[RFC] diff: add `diff.<driver>.process` for external hunk providers #2120

gitgitgadget · 2026-05-26T05:46:31Z

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > +struct diff_subprocess { > + struct subprocess_entry subprocess; > + unsigned int supported_capabilities; > +}; > + > +static int subprocess_map_initialized; > +static struct hashmap subprocess_map; Can we avoid introducing new global variables like these? Would "struct userdiff_driver" or "struct diff_options" be a good place to hang this hashmap, perhaps? > +static int send_file_content(int fd, const char *buf, long size) > +{ > + int ret; > + > + if (size > 0) > + ret = write_packetized_from_buf_no_flush(buf, size, fd); > + else > + ret = 0; Shouldn't "size == -24" be flagged as an invalid input? > + if (ret) > + return ret; > + return packet_flush_gently(fd); > +} > +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) > +{ > +... > +} This gives a silent error diagnosis, which is good for a lower level helper. > +int diff_process_get_hunks(struct userdiff_driver *drv, > + const char *path, > + const char *old_buf, long old_size, > + const char *new_buf, long new_size, > + struct xdl_hunk **hunks_out, > + size_t *nr_hunks_out) > +{ > + struct diff_subprocess *backend; > + struct child_process *process; > + int fd_in, fd_out; > + struct strbuf status = STRBUF_INIT; > + struct xdl_hunk *hunks = NULL; > + struct xdl_hunk hunk; > + size_t nr_hunks = 0, alloc_hunks = 0; > + int len; > + char *line; > + > + if (!drv || !drv->process) > + return -1; A driver that does not define process is not an error; it is perfectly normal in the current world order where nobody has such an external process and even fi this patch lands, external processes are optional. So here "return -1" does not mean an error, and silent return is perfectly fine. > + backend = find_or_start_process(drv->process); > + if (!backend) > + return -1; This is probably an error; the user specified drv->process, we either tried to find or start the process and failed. Isn't it an event that deserves to be reported in an error message? > + if (!(backend->supported_capabilities & CAP_HUNKS)) > + return -1; Backend started, but the "hunks" feature is not supported. Perhaps in a year or two, this external process protocol may have become so popular that it gained more capabilities, possibly making get_hunks obsolete. We may be looking at such an external process that uses other capabilities but not this one. This is not an error, so silent return is perfectly fine. > + process = subprocess_get_child_process(&backend->subprocess); > + fd_in = process->in; > + fd_out = process->out; > + > + /* Send request */ > + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || > + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || > + packet_flush_gently(fd_in)) > + goto error; > + > + /* Send old file content */ > + if (send_file_content(fd_in, old_buf, old_size)) > + goto error; > + > + /* Send new file content */ > + if (send_file_content(fd_in, new_buf, new_size)) > + goto error; > + > + /* Read hunks until flush packet */ > + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && > + line) { > + if (parse_hunk_line(line, &hunk) < 0) > + goto error; > + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); > + hunks[nr_hunks++] = hunk; > + } > + if (len < 0) > + goto error; > + > + /* Read status */ > + if (subprocess_read_status(fd_out, &status)) > + goto error; > + > + if (strcmp(status.buf, "success")) { > + if (!strcmp(status.buf, "abort")) > + backend->supported_capabilities &= ~CAP_HUNKS; > + goto error; > + } > + > + *hunks_out = hunks; > + *nr_hunks_out = nr_hunks; > + strbuf_release(&status); > + return 0; > + > +error: All exceptions that lead here look like events that should be reported to the end-user. > + free(hunks); > + strbuf_release(&status); > + return -1; > +} > +/* > + * Query a diff process for hunks describing the changes > + * between old_buf and new_buf. > + * > + * The backend is a long-running subprocess configured via > + * diff.<driver>.process. It receives file content via > + * pkt-line and returns hunks with 1-based line numbers. > + * > + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated > + * array (caller must free) and returns 0. > + * > + * On failure, returns -1. The caller should fall back to the > + * builtin diff algorithm. > + */ I do not agree with this. If it is a failure, the user should fix the external process (or disable). It shouldn't be hidden behind a fallback. As I left comments, in this round of implementation, there are conditions that returns -1 for soemthing that is not an error (i.e., not configured, or process not supporting the particular capability) *and* in those cases the caller should fall back as if nothing happened. But some error cases, the caller should't hide them.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Mon, May 25, 2026 at 6:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > +struct diff_subprocess { > > + struct subprocess_entry subprocess; > > + unsigned int supported_capabilities; > > +}; > > + > > +static int subprocess_map_initialized; > > +static struct hashmap subprocess_map; > > Can we avoid introducing new global variables like these? Would > "struct userdiff_driver" or "struct diff_options" be a good place to > hang this hashmap, perhaps? > Will clean this up. > > +static int send_file_content(int fd, const char *buf, long size) > > +{ > > + int ret; > > + > > + if (size > 0) > > + ret = write_packetized_from_buf_no_flush(buf, size, fd); > > + else > > + ret = 0; > > Shouldn't "size == -24" be flagged as an invalid input? > Will fix and do a broader audit of input validation and bounds checking. > > + if (ret) > > + return ret; > > + return packet_flush_gently(fd); > > +} > > > +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk) > > +{ > > +... > > +} > > This gives a silent error diagnosis, which is good for a lower level > helper. > > > +int diff_process_get_hunks(struct userdiff_driver *drv, > > + const char *path, > > + const char *old_buf, long old_size, > > + const char *new_buf, long new_size, > > + struct xdl_hunk **hunks_out, > > + size_t *nr_hunks_out) > > +{ > > + struct diff_subprocess *backend; > > + struct child_process *process; > > + int fd_in, fd_out; > > + struct strbuf status = STRBUF_INIT; > > + struct xdl_hunk *hunks = NULL; > > + struct xdl_hunk hunk; > > + size_t nr_hunks = 0, alloc_hunks = 0; > > + int len; > > + char *line; > > + > > + if (!drv || !drv->process) > > + return -1; > > A driver that does not define process is not an error; it is > perfectly normal in the current world order where nobody has such an > external process and even fi this patch lands, external processes > are optional. So here "return -1" does not mean an error, and > silent return is perfectly fine. > > > + backend = find_or_start_process(drv->process); > > + if (!backend) > > + return -1; > > This is probably an error; the user specified drv->process, we > either tried to find or start the process and failed. Isn't it an > event that deserves to be reported in an error message? > > > + if (!(backend->supported_capabilities & CAP_HUNKS)) > > + return -1; > > Backend started, but the "hunks" feature is not supported. Perhaps > in a year or two, this external process protocol may have become so > popular that it gained more capabilities, possibly making get_hunks > obsolete. We may be looking at such an external process that uses > other capabilities but not this one. This is not an error, so > silent return is perfectly fine. > > > + process = subprocess_get_child_process(&backend->subprocess); > > + fd_in = process->in; > > + fd_out = process->out; > > + > > + /* Send request */ > > + if (packet_write_fmt_gently(fd_in, "command=hunks\n") || > > + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) || > > + packet_flush_gently(fd_in)) > > + goto error; > > + > > + /* Send old file content */ > > + if (send_file_content(fd_in, old_buf, old_size)) > > + goto error; > > + > > + /* Send new file content */ > > + if (send_file_content(fd_in, new_buf, new_size)) > > + goto error; > > + > > + /* Read hunks until flush packet */ > > + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 && > > + line) { > > + if (parse_hunk_line(line, &hunk) < 0) > > + goto error; > > + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks); > > + hunks[nr_hunks++] = hunk; > > + } > > + if (len < 0) > > + goto error; > > + > > + /* Read status */ > > + if (subprocess_read_status(fd_out, &status)) > > + goto error; > > + > > + if (strcmp(status.buf, "success")) { > > + if (!strcmp(status.buf, "abort")) > > + backend->supported_capabilities &= ~CAP_HUNKS; > > + goto error; > > + } > > + > > + *hunks_out = hunks; > > + *nr_hunks_out = nr_hunks; > > + strbuf_release(&status); > > + return 0; > > + > > +error: > > All exceptions that lead here look like events that should be > reported to the end-user. > Agreed on all points. I will restructure things so errors are flagged when appropriate (i.e., user specified a process but one was not found / couldn't start and exceptions) and non-errors are treated as they should be. > > + free(hunks); > > + strbuf_release(&status); > > + return -1; > > +} > > > +/* > > + * Query a diff process for hunks describing the changes > > + * between old_buf and new_buf. > > + * > > + * The backend is a long-running subprocess configured via > > + * diff.<driver>.process. It receives file content via > > + * pkt-line and returns hunks with 1-based line numbers. > > + * > > + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated > > + * array (caller must free) and returns 0. > > + * > > + * On failure, returns -1. The caller should fall back to the > > + * builtin diff algorithm. > > + */ > > I do not agree with this. If it is a failure, the user should fix > the external process (or disable). It shouldn't be hidden behind a > fallback. As I left comments, in this round of implementation, > there are conditions that returns -1 for soemthing that is not an > error (i.e., not configured, or process not supporting the > particular capability) *and* in those cases the caller should fall > back as if nothing happened. But some error cases, the caller > should't hide them. Will address in a follow-up. Thank you for the feedback!

gitgitgadget · 2026-05-26T05:46:39Z

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > Zero hunks with status=success means the tool considers the > files equivalent. Git skips diff output for that file. Is "zero hunk" a common word or some random string you invented? If the latter, which is I am assuming it to be, you should define what it means at/before the first use. Here in the proposed log message, and ... > > Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> > --- > Documentation/config/diff.adoc | 8 + > Documentation/gitattributes.adoc | 40 ++++ > Makefile | 1 + > diff-process.c | 206 +++++++++++++++++++ > diff-process.h | 28 +++ > diff.c | 23 +++ > t/.gitattributes | 1 + > t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++ > 8 files changed, 645 insertions(+) > create mode 100644 diff-process.c > create mode 100644 diff-process.h > create mode 100755 t/t4080-diff-process.sh > > diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc > index 1135a62a0a..4ab5f60df6 100644 > --- a/Documentation/config/diff.adoc > +++ b/Documentation/config/diff.adoc > @@ -218,6 +218,14 @@ endif::git-diff[] > Set this option to `true` to make the diff driver cache the text > conversion outputs. See linkgit:gitattributes[5] for details. > > +`diff.<driver>.process`:: > + The command to run as a long-running diff process. > + The tool communicates via the pkt-line protocol and returns > + hunks that are fed into Git's diff and blame pipelines. > + If the tool returns zero hunks, the file is treated as > + unchanged for both diff output and blame attribution. > + See linkgit:gitattributes[5] for details. ... also here. I do not know if you mean "the tool returns no hunks" (there is no "hunk <old_start> <old_count> <new_start> <new_count>" line passed from the tool over the protocol) or "the tool returns zero-hunk" (there is a special "zero-hunk" message to signal this particular condition sent over the protocol), and this description does not quite help disambiguating between the two. If the former, then avoid "zero hunks" as it sounds like a noun with special meaning. Yes, we can say "tool returns one hunk", "tool returns 31 hunks", etc., so "tool returns zero hunks" may logically be correct, but "when the tool returns no hunks with status=success" is much less confusing, I think.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Mon, May 25, 2026 at 7:26 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > Zero hunks with status=success means the tool considers the > > files equivalent. Git skips diff output for that file. > > Is "zero hunk" a common word or some random string you invented? If > the latter, which is I am assuming it to be, you should define what > it means at/before the first use. Here in the proposed log message, > and ... > > > > > Signed-off-by: Michael Montalbo <mmontalbo@gmail.com> > > --- > > Documentation/config/diff.adoc | 8 + > > Documentation/gitattributes.adoc | 40 ++++ > > Makefile | 1 + > > diff-process.c | 206 +++++++++++++++++++ > > diff-process.h | 28 +++ > > diff.c | 23 +++ > > t/.gitattributes | 1 + > > t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++ > > 8 files changed, 645 insertions(+) > > create mode 100644 diff-process.c > > create mode 100644 diff-process.h > > create mode 100755 t/t4080-diff-process.sh > > > > diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc > > index 1135a62a0a..4ab5f60df6 100644 > > --- a/Documentation/config/diff.adoc > > +++ b/Documentation/config/diff.adoc > > @@ -218,6 +218,14 @@ endif::git-diff[] > > Set this option to `true` to make the diff driver cache the text > > conversion outputs. See linkgit:gitattributes[5] for details. > > > > +`diff.<driver>.process`:: > > + The command to run as a long-running diff process. > > + The tool communicates via the pkt-line protocol and returns > > + hunks that are fed into Git's diff and blame pipelines. > > + If the tool returns zero hunks, the file is treated as > > + unchanged for both diff output and blame attribution. > > + See linkgit:gitattributes[5] for details. > > ... also here. > > I do not know if you mean "the tool returns no hunks" (there is no > "hunk <old_start> <old_count> <new_start> <new_count>" line passed > from the tool over the protocol) or "the tool returns zero-hunk" > (there is a special "zero-hunk" message to signal this particular > condition sent over the protocol), and this description does not > quite help disambiguating between the two. > > If the former, then avoid "zero hunks" as it sounds like a noun with > special meaning. Yes, we can say "tool returns one hunk", "tool > returns 31 hunks", etc., so "tool returns zero hunks" may logically > be correct, but "when the tool returns no hunks with status=success" > is much less confusing, I think. Yes, "zero hunks" was my own invention and I see why it's confusing. Will update the messaging to use "no hunks" instead and do a broader sweep of the documentation to clarify the protocol and expected tool behavior.

gitgitgadget · 2026-06-07T14:55:48Z

Johannes Schindelin wrote on the Git mailing list (how to reply to this email):

Hi Michael, I stumbled about this patch when it broke CI in Git for Windows, where we do _not_ use `NO_PYTHON`, even though Python is unavailable in the build/test CI jobs. The existing tests handle this situation gracefully, this here patch does not: On Sun, 7 Jun 2026, Michael Montalbo via GitGitGadget wrote: > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > new file mode 100755 > index 0000000000..f159cd86d8 > --- /dev/null > +++ b/t/t4080-diff-process.sh > @@ -0,0 +1,538 @@ > +#!/bin/sh > + > +test_description='diff process via long-running process' > + > +. ./test-lib.sh > + > +if test_have_prereq PYTHON > +then > + PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) When neither `python3` nor `python` are available (which is the case in the minimal Git for Windows SDK used in Git's CI runs), this fails under `set -e`. Before even running the first test case. Resulting in an unexpected TAP format error. Now, we could "fix" this by imitating what `lib-p4` does (see https://github.com/dscho/git/commit/bd0b5570c744f678911a67a62da63f30655f20d8 which demonstrates that it is indeed a work-around on Windows): -- snip -- diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index fdf6da1c341e67..bd22c247ff3856 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -4,9 +4,10 @@ test_description='diff process via long-running process' . ./test-lib.sh -if test_have_prereq PYTHON +if ! test_have_prereq PYTHON || ! test -x "$PYTHON_PATH" then - PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) + skip_all='python interpreter not available' + test_done fi # -- snap -- Of course, this uncovers _another_ problem with the Python script: It uses Python3-only `f"..."` format strings, which cannot be handled by the Python2 to which the `PYTHON_PATH` variable in `linux-TEST-vars` points. So this requires _another follow-up (see also https://github.com/dscho/git/commit/c12a9f4c80e5ce8db0fe370fac46fb45be2b775f): -- snip -- diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh index bd22c247ff3856..ba14682a9086e4 100755 --- a/t/t4080-diff-process.sh +++ b/t/t4080-diff-process.sh @@ -39,7 +39,8 @@ setup_backend () { def write_pkt(line): data = (line + "\n").encode() - sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) + hdr = "{:04x}".format(len(data) + 4).encode() + sys.stdout.buffer.write(hdr + data) sys.stdout.buffer.flush() def write_flush(): @@ -98,7 +99,8 @@ setup_backend () { new = read_content() old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" - log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") + log("command={} pathname={} old={} new={}".format( + cmd, pathname, old_first, new_first)) if mode == "error": write_flush() @@ -130,7 +132,7 @@ setup_backend () { else: ol = old.count(b"\n") nl = new.count(b"\n") - write_pkt(f"hunk 1 {ol} 1 {nl}") + write_pkt("hunk 1 {} 1 {}".format(ol, nl)) write_flush() write_pkt("status=success") write_flush() -- snap -- And this is still not enough to make it work with Python2, see https://github.com/dscho/git/actions/runs/27091523842/job/79955895737: -- snip -- [...] + git -c diff.cdiff.process=./diff-process-backend --mode=fixed-hunk diff boundary.c Traceback (most recent call last): File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 45, in <module> assert read_pkt() == "git-diff-client" File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 4, in read_pkt hdr = sys.stdin.buffer.read(4) AttributeError: 'file' object has no attribute 'buffer' -- snap -- I have experienced similar patterns in my career, where a single decision required multiple follow-up fixes _just_ to avoid having to revert that decision. This kind of doubling down has never ended well. Therefore I would like to take a step back, and ask: Is it _really_ a good idea to use Python here? Are we certain that we want to _require_ Python to run this test and skip it if Python isn't available (as is the case in the Windows-related parts of Git's very own CI) even if Python has nothing at all to do with the feature that is being tested? I don't want to be doomed to repeat history, and we can very well learn e.g. from prior art in this very project, where the tests for the clean/smudge filters (which _also_ want to speak pkt-line over stdio) needlessly incurred Perl as a requirement to run the tests. It was Matheus's heroic work in 52917a998ef3a (t0021: implementation the rot13-filter.pl script in C, 2022-08-14) and 4d1d843be7a15 (tests: use the new C rot13-filter helper to avoid PERL prereq, 2022-08-14) that avoided that unnecessary prerequisite. Likewise, there is `test-tool pkt-line` intended for driving the pkt-line protocol via simple shell scripts. So the conscious project direction has been: fold pkt-line test backends into `test-tool` and drop the scripting-language prereq. Reintroducing the same shape in Python would walk this back. Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl', 2025-04-29) has been another clear sign that the Git project is actively _removing_ scripting-language dependencies from the build and test infrastructure, not adding new ones. The clear prior art in Git's own tests for what t4080 wants to do, as of today, is `t/helper/test-rot13-filter.c`, which could be imitated here instead of (re-)introducing a dependency on a scripting language other than Unix shell in Git's test suite. The `PYTHON` prereq exists in exactly five files today, all `git p4` related (where Python is an intrinsic prerequisite given that `git-p4.py` _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`, `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`, and `t/t9836-git-p4-metadata-encoding-python3.sh`. After 7cdbff14d482 (remove merge-recursive-old, 2006-11-20), this here patch would be the first one, after almost 20 years, to re-introduce Python as a dependency outside `git p4`. And it would also be the first ever to embed a Python script as a heredoc: > +fi > + > +# > +# A single parametric diff process. > +# Usage: diff-process-backend --mode=<mode> [--log=<path>] > +# > +# Modes: > +# whole-file - report all lines as changed (default) > +# fixed-hunk - always report hunk 5 2 5 2 > +# bad-hunk - report out-of-bounds hunk 999 1 999 1 > +# bad-sync - report hunk with mismatched unchanged totals > +# overlap - report two overlapping hunks > +# no-hunks - return no hunks (files considered equivalent) > +# error - return status=error for every request > +# abort - return status=abort for every request > +# crash - read one request then exit without responding > +# > +setup_backend () { > + cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF > + import sys, os > + > + def read_pkt(): > + hdr = sys.stdin.buffer.read(4) > + if len(hdr) < 4: return None > + length = int(hdr, 16) > + if length == 0: return "" > + data = sys.stdin.buffer.read(length - 4) > + return data.decode().rstrip("\n") > + > + def write_pkt(line): > + data = (line + "\n").encode() > + sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) > + sys.stdout.buffer.flush() > + > + def write_flush(): > + sys.stdout.buffer.write(b"0000") > + sys.stdout.buffer.flush() > + > + def read_content(): > + chunks = [] > + while True: > + hdr = sys.stdin.buffer.read(4) > + if len(hdr) < 4: break > + length = int(hdr, 16) > + if length == 0: break > + chunks.append(sys.stdin.buffer.read(length - 4)) > + return b"".join(chunks) > + > + mode = "whole-file" > + logfile = None > + for arg in sys.argv[1:]: > + if arg.startswith("--mode="): > + mode = arg[7:] > + elif arg.startswith("--log="): > + logfile = open(arg[6:], "a") > + > + def log(msg): > + if logfile: > + logfile.write(msg + "\n") > + logfile.flush() > + > + # Handshake > + assert read_pkt() == "git-diff-client" > + assert read_pkt() == "version=1" > + read_pkt() > + write_pkt("git-diff-server") > + write_pkt("version=1") > + write_flush() > + while True: > + p = read_pkt() > + if p == "": break > + write_pkt("capability=hunks") > + write_flush() > + > + log("ready") > + > + while True: > + cmd = None > + pathname = None > + while True: > + p = read_pkt() > + if p is None: sys.exit(0) > + if p == "": break > + if p.startswith("command="): cmd = p.split("=",1)[1] > + if p.startswith("pathname="): pathname = p.split("=",1)[1] > + if cmd is None: sys.exit(0) > + old = read_content() > + new = read_content() > + old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" > + new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" > + log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") > + > + if mode == "error": > + write_flush() > + write_pkt("status=error") > + write_flush() > + continue > + > + if mode == "abort": > + write_flush() > + write_pkt("status=abort") > + write_flush() > + continue > + > + if mode == "crash": > + sys.exit(1) > + > + if cmd == "hunks": > + if mode == "fixed-hunk": > + write_pkt("hunk 5 2 5 2") > + elif mode == "bad-hunk": > + write_pkt("hunk 999 1 999 1") > + elif mode == "bad-sync": > + write_pkt("hunk 1 2 1 1") > + elif mode == "overlap": > + write_pkt("hunk 1 5 1 5") > + write_pkt("hunk 3 2 3 2") > + elif mode == "no-hunks": > + pass > + else: > + ol = old.count(b"\n") > + nl = new.count(b"\n") > + write_pkt(f"hunk 1 {ol} 1 {nl}") > + write_flush() > + write_pkt("status=success") > + write_flush() > + else: > + write_flush() > + write_pkt("status=error") > + write_flush() > + PYEOF The existing pattern is to provide larger scripts as fixtures in associated `t/tNNNN/` directories, not as heredoc, see e.g. `t/t1509/prepare-chroot.sh`. Writing scripts, especially lengthy ones, in heredoc strings makes it virtually impossible to use static code analysis or syntax highlighting to fend off banal errors. Given the complexity of what t4080 tries to test (error, abort, crash, bad-sync, no-hunks, multiple files in one session, capability negotiation), it would unfortunately be infeasible to use `test-tool pkt-line` from a shell script implementing that `diff.*.process` protocol. So I've spiked a demo how the `test-tool diff-process-backend` could look like (letting Opus do the menial typing, so that I can enjoy at least part of a sunny Sunday outside), which also passes the CI build and test: https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6 That commit is of course not intended to be used as-is; Feel free to pick code parts of it and integrate them into your topic branch. Or write your own test-tool helper from scratch if that's more your jam. Ciao, Johannes > + write_script diff-process-backend <<-SHEOF > + exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@" > + SHEOF > +} > + > +BACKEND="./diff-process-backend" > + > +test_expect_success PYTHON 'setup' ' > + setup_backend && > + echo "*.c diff=cdiff" >.gitattributes && > + git add .gitattributes && > + > + # boundary.c: 10 lines, changes at 5-6 and 9-10. > + # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap. > + cat >boundary.c <<-\EOF && > + line1 > + line2 > + line3 > + line4 > + OLD5 > + OLD6 > + line7 > + line8 > + OLD9 > + OLD10 > + EOF > + git add boundary.c && > + > + # worddiff.c: single-line function, value changes 1 -> 999. > + # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat. > + cat >worddiff.c <<-\EOF && > + int value(void) { return 1; } > + EOF > + git add worddiff.c && > + > + # newfile.c: single-line function, value changes 42 -> 99. > + # Used by: new file, --exit-code, multiple drivers. > + cat >newfile.c <<-\EOF && > + int new_func(void) { return 42; } > + EOF > + git add newfile.c && > + > + # logtest.c: single-line function for log/format-patch tests. > + # Needs two commits so log -1 has a diff. > + cat >logtest.c <<-\EOF && > + int logfunc(void) { return 1; } > + EOF > + git add logtest.c && > + > + # two.c/one.c: two-file pair for error/abort/startup-failure tests. > + cat >one.c <<-\EOF && > + int first(void) { return 1; } > + EOF > + cat >two.c <<-\EOF && > + int second(void) { return 2; } > + EOF > + git add one.c two.c && > + > + git commit -m "initial" && > + > + # Second commit for logtest.c (so log -1 has something to show). > + cat >logtest.c <<-\EOF && > + int logfunc(void) { return 2; } > + EOF > + git add logtest.c && > + git commit -m "change logtest.c" && > + > + # Working tree modifications (not committed). > + cat >boundary.c <<-\EOF && > + line1 > + line2 > + line3 > + line4 > + NEW5 > + NEW6 > + line7 > + line8 > + NEW9 > + NEW10 > + EOF > + > + cat >worddiff.c <<-\EOF && > + int value(void) { return 999; } > + EOF > + > + cat >newfile.c <<-\EOF && > + int new_func(void) { return 99; } > + EOF > + > + cat >one.c <<-\EOF && > + int first(void) { return 10; } > + EOF > + > + cat >two.c <<-\EOF > + int second(void) { return 20; } > + EOF > +' > + > +# > +# Core behavior: the tool controls which lines are marked as changed. > +# > + > +test_expect_success PYTHON 'diff process hunk boundaries affect output' ' > + # The file has changes at lines 5-6 and 9-10, but fixed-hunk > + # only reports lines 5-6 as changed. Lines 9-10 should not > + # appear as changed in the output. > + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ > + diff boundary.c >actual && > + test_grep "^-OLD5" actual && > + test_grep "^-OLD6" actual && > + test_grep "^+NEW5" actual && > + test_grep "^+NEW6" actual && > + test_grep ! "^-OLD9" actual && > + test_grep ! "^-OLD10" actual && > + test_grep ! "^+NEW9" actual && > + test_grep ! "^+NEW10" actual > +' > + > +test_expect_success PYTHON 'diff process works with new file' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff -- newfile.c >actual 2>stderr && > + test_grep "return 99" actual && > + test_grep "pathname=newfile.c" backend.log && > + test_must_be_empty stderr > +' > + > +test_expect_success PYTHON 'diff process works with added file (empty old side)' ' > + cat >added.c <<-\EOF && > + int added(void) { return 1; } > + EOF > + git add added.c && > + > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff --cached -- added.c >actual 2>stderr && > + test_grep "added" actual && > + test_grep "pathname=added.c" backend.log && > + test_must_be_empty stderr > +' > + > +test_expect_success PYTHON 'diff process skipped for binary files' ' > + printf "\\0binary" >binary.c && > + git add binary.c && > + git commit -m "add binary" && > + printf "\\0changed" >binary.c && > + > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff -- binary.c >actual && > + test_grep "Binary files" actual && > + test_path_is_missing backend.log > +' > + > +test_expect_success PYTHON 'diff process not consulted for unmatched driver' ' > + echo "not tracked by cdiff" >unmatched.txt && > + git add unmatched.txt && > + git commit -m "add unmatched.txt" && > + > + echo "modified" >unmatched.txt && > + > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff -- unmatched.txt >actual && > + test_grep "modified" actual && > + test_path_is_missing backend.log > +' > + > +test_expect_success PYTHON 'multiple drivers use separate processes' ' > + echo "*.h diff=hdiff" >>.gitattributes && > + git add .gitattributes && > + > + cat >multi.h <<-\EOF && > + int header(void) { return 1; } > + EOF > + git add multi.h && > + git commit -m "add multi.h" && > + > + cat >multi.h <<-\EOF && > + int header(void) { return 2; } > + EOF > + > + rm -f backend-c.log backend-h.log && > + git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \ > + -c diff.hdiff.process="$BACKEND --log=backend-h.log" \ > + diff -- newfile.c multi.h >actual 2>stderr && > + test_grep "pathname=newfile.c" backend-c.log && > + test_grep "pathname=multi.h" backend-h.log && > + test_must_be_empty stderr > +' > + > +test_expect_success PYTHON 'diff process works alongside textconv' ' > + write_script uppercase-filter <<-\EOF && > + tr "a-z" "A-Z" <"$1" > + EOF > + > + cat >textconv.c <<-\EOF && > + hello world > + EOF > + git add textconv.c && > + git commit -m "add textconv.c" && > + > + cat >textconv.c <<-\EOF && > + goodbye world > + EOF > + > + rm -f backend.log && > + git -c diff.cdiff.textconv="./uppercase-filter" \ > + -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff -- textconv.c >actual 2>stderr && > + # The diff process receives textconv-transformed (uppercase) content. > + test_grep "pathname=textconv.c" backend.log && > + test_grep "old=HELLO WORLD" backend.log && > + test_grep "new=GOODBYE WORLD" backend.log && > + test_must_be_empty stderr > +' > + > +# > +# Downstream features: word diff, log, equivalent files, exit code. > +# > + > +test_expect_success PYTHON 'diff process with --word-diff' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff --word-diff worddiff.c >actual 2>stderr && > + test_grep "\[-1;-\]" actual && > + test_grep "{+999;+}" actual && > + test_grep "pathname=worddiff.c" backend.log && > + test_must_be_empty stderr > +' > + > +test_expect_success PYTHON 'diff process works with git log -p' ' > + # With no-hunks mode, the tool says the files are equivalent, > + # so log -p should show the commit but no diff content. > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ > + log -1 -p -- logtest.c >actual 2>stderr && > + test_grep "change logtest.c" actual && > + test_grep ! "return 2" actual && > + test_grep "command=hunks pathname=logtest.c" backend.log && > + test_must_be_empty stderr > +' > + > +test_expect_success PYTHON 'diff process no hunks suppresses diff output' ' > + cat >nohunks.c <<-\EOF && > + int zero(void) { return 0; } > + EOF > + git add nohunks.c && > + git commit -m "add nohunks.c" && > + > + cat >nohunks.c <<-\EOF && > + int zero(void) { return 999; } > + EOF > + > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ > + diff nohunks.c >actual && > + test_must_be_empty actual > +' > + > +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' ' > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ > + diff --exit-code nohunks.c > +' > + > +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' ' > + test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \ > + diff --exit-code newfile.c > +' > + > +# > +# Bypass mechanisms: flags and commands that skip the diff process. > +# > + > +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff --diff-algorithm=patience worddiff.c >actual && > + test_grep "return 999" actual && > + test_path_is_missing backend.log > +' > + > +test_expect_success PYTHON 'diff process not used by --stat' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > + diff --stat worddiff.c >actual && > + test_grep "worddiff.c" actual && > + test_path_is_missing backend.log > +' > + > +# > +# Error handling and fallback. > +# > + > +test_expect_success PYTHON 'diff process fallback on tool error status' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ > + diff boundary.c >actual 2>stderr && > + # Fallback produces the full builtin diff (both change regions). > + test_grep "^-OLD5" actual && > + test_grep "^+NEW5" actual && > + test_grep "^-OLD9" actual && > + test_grep "^+NEW9" actual && > + # Tool was contacted (it replied with error, not crash). > + test_grep "command=hunks pathname=boundary.c" backend.log && > + test_grep "diff process.*failed" stderr > +' > + > +test_expect_success PYTHON 'diff process error keeps tool available for next file' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ > + diff -- one.c two.c >actual 2>stderr && > + # Unlike abort, error keeps the tool available: both files > + # are sent to the tool (and both fall back). > + test_grep "pathname=one.c" backend.log && > + test_grep "pathname=two.c" backend.log && > + test_grep "return 10" actual && > + test_grep "return 20" actual > +' > + > +test_expect_success PYTHON 'diff process abort disables for session' ' > + rm -f backend.log && > + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ > + diff -- one.c two.c >actual && > + # Both files should still produce diff output via fallback. > + test_grep "return 10" actual && > + test_grep "return 20" actual && > + # The tool aborts on the first file and git clears its > + # capability. The second file never contacts the tool. > + test_grep "pathname=one.c" backend.log && > + test_grep ! "pathname=two.c" backend.log > +' > + > +test_expect_success PYTHON 'diff process fallback on tool crash' ' > + git -c diff.cdiff.process="$BACKEND --mode=crash" \ > + diff boundary.c >actual 2>stderr && > + test_grep "^-OLD5" actual && > + test_grep "^+NEW5" actual && > + test_grep "^-OLD9" actual && > + test_grep "^+NEW9" actual && > + # Crash is a communication failure, so a warning is emitted. > + test_grep "diff process.*failed" stderr > +' > + > +test_expect_success PYTHON 'diff process startup failure only warns once' ' > + git -c diff.cdiff.process="/nonexistent/tool" \ > + diff -- one.c two.c >actual 2>stderr && > + # Both files produce diff output via fallback. > + test_grep "return 10" actual && > + test_grep "return 20" actual && > + # Sentinel prevents repeated warnings: only one, not one per file. > + test_grep "diff process.*failed" stderr >warnings && > + test_line_count = 1 warnings > +' > + > +test_expect_success PYTHON 'diff process fallback on bad hunks' ' > + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ > + diff boundary.c >actual 2>stderr && > + test_grep "^-OLD5" actual && > + test_grep "^+NEW5" actual && > + test_grep "^-OLD9" actual && > + test_grep "^+NEW9" actual && > + # Invalid hunks are caught by xdiff validation, not the > + # protocol layer, so no warning is emitted. > + test_must_be_empty stderr > +' > + > +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' ' > + cat >synctest.c <<-\EOF && > + line1 > + line2 > + line3 > + EOF > + git add synctest.c && > + git commit -m "add synctest.c" && > + > + cat >synctest.c <<-\EOF && > + line1 > + changed > + line3 > + EOF > + > + # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new > + # line as changed, leaving 1 unchanged old vs 2 unchanged new. > + # The synchronization invariant fails and git falls back. > + git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \ > + diff synctest.c >actual 2>stderr && > + test_grep "changed" actual > +' > + > +test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' > + # boundary.c has 10 lines, so both hunks are in bounds > + # but they overlap at lines 3-5, triggering the ordering check. > + git -c diff.cdiff.process="$BACKEND --mode=overlap" \ > + diff boundary.c >actual 2>stderr && > + test_grep "NEW5" actual > +' > + > +test_done > diff --git a/userdiff.h b/userdiff.h > index 51c26e0d41..a98eabe377 100644 > --- a/userdiff.h > +++ b/userdiff.h > @@ -3,6 +3,7 @@ > > #include "notes-cache.h" > > +struct diff_subprocess; > struct index_state; > struct repository; > > @@ -33,6 +34,8 @@ struct userdiff_driver { > int textconv_want_cache; > const char *process; > char *process_owned; > + struct diff_subprocess *diff_subprocess; > + unsigned diff_process_failed : 1; > }; > enum userdiff_driver_type { > USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, > -- > gitgitgadget > > >

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Sun, Jun 7, 2026 at 7:36 AM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Michael, > > I stumbled about this patch when it broke CI in Git for Windows, where we > do _not_ use `NO_PYTHON`, even though Python is unavailable in the > build/test CI jobs. The existing tests handle this situation gracefully, > this here patch does not: > > On Sun, 7 Jun 2026, Michael Montalbo via GitGitGadget wrote: > > > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > > new file mode 100755 > > index 0000000000..f159cd86d8 > > --- /dev/null > > +++ b/t/t4080-diff-process.sh > > @@ -0,0 +1,538 @@ > > +#!/bin/sh > > + > > +test_description='diff process via long-running process' > > + > > +. ./test-lib.sh > > + > > +if test_have_prereq PYTHON > > +then > > + PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) > > When neither `python3` nor `python` are available (which is the case in > the minimal Git for Windows SDK used in Git's CI runs), this fails under > `set -e`. Before even running the first test case. Resulting in an > unexpected TAP format error. > > Now, we could "fix" this by imitating what `lib-p4` does (see > https://github.com/dscho/git/commit/bd0b5570c744f678911a67a62da63f30655f20d8 > which demonstrates that it is indeed a work-around on Windows): > > -- snip -- > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > index fdf6da1c341e67..bd22c247ff3856 100755 > --- a/t/t4080-diff-process.sh > +++ b/t/t4080-diff-process.sh > @@ -4,9 +4,10 @@ test_description='diff process via long-running process' > > . ./test-lib.sh > > -if test_have_prereq PYTHON > +if ! test_have_prereq PYTHON || ! test -x "$PYTHON_PATH" > then > - PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) > + skip_all='python interpreter not available' > + test_done > fi > > # > -- snap -- > > Of course, this uncovers _another_ problem with the Python script: It uses > Python3-only `f"..."` format strings, which cannot be handled by the > Python2 to which the `PYTHON_PATH` variable in `linux-TEST-vars` points. > So this requires _another follow-up (see also > https://github.com/dscho/git/commit/c12a9f4c80e5ce8db0fe370fac46fb45be2b775f): > > -- snip -- > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > index bd22c247ff3856..ba14682a9086e4 100755 > --- a/t/t4080-diff-process.sh > +++ b/t/t4080-diff-process.sh > @@ -39,7 +39,8 @@ setup_backend () { > > def write_pkt(line): > data = (line + "\n").encode() > - sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) > + hdr = "{:04x}".format(len(data) + 4).encode() > + sys.stdout.buffer.write(hdr + data) > sys.stdout.buffer.flush() > > def write_flush(): > @@ -98,7 +99,8 @@ setup_backend () { > new = read_content() > old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" > new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" > - log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") > + log("command={} pathname={} old={} new={}".format( > + cmd, pathname, old_first, new_first)) > > if mode == "error": > write_flush() > @@ -130,7 +132,7 @@ setup_backend () { > else: > ol = old.count(b"\n") > nl = new.count(b"\n") > - write_pkt(f"hunk 1 {ol} 1 {nl}") > + write_pkt("hunk 1 {} 1 {}".format(ol, nl)) > write_flush() > write_pkt("status=success") > write_flush() > -- snap -- > > And this is still not enough to make it work with Python2, see > https://github.com/dscho/git/actions/runs/27091523842/job/79955895737: > > -- snip -- > [...] > + git -c diff.cdiff.process=./diff-process-backend --mode=fixed-hunk diff boundary.c > Traceback (most recent call last): > File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 45, in <module> > assert read_pkt() == "git-diff-client" > File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 4, in read_pkt > hdr = sys.stdin.buffer.read(4) > AttributeError: 'file' object has no attribute 'buffer' > -- snap -- > > I have experienced similar patterns in my career, where a single decision > required multiple follow-up fixes _just_ to avoid having to revert that > decision. This kind of doubling down has never ended well. > > Therefore I would like to take a step back, and ask: Is it _really_ a good > idea to use Python here? Are we certain that we want to _require_ Python > to run this test and skip it if Python isn't available (as is the case in > the Windows-related parts of Git's very own CI) even if Python has nothing > at all to do with the feature that is being tested? > > I don't want to be doomed to repeat history, and we can very well learn > e.g. from prior art in this very project, where the tests for the > clean/smudge filters (which _also_ want to speak pkt-line over stdio) > needlessly incurred Perl as a requirement to run the tests. It was > Matheus's heroic work in 52917a998ef3a (t0021: implementation the > rot13-filter.pl script in C, 2022-08-14) and 4d1d843be7a15 (tests: use the > new C rot13-filter helper to avoid PERL prereq, 2022-08-14) that avoided > that unnecessary prerequisite. > > Likewise, there is `test-tool pkt-line` intended for driving the pkt-line > protocol via simple shell scripts. > > So the conscious project direction has been: fold pkt-line test backends > into `test-tool` and drop the scripting-language prereq. Reintroducing the > same shape in Python would walk this back. > > Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl', > 2025-04-29) has been another clear sign that the Git project is actively > _removing_ scripting-language dependencies from the build and test > infrastructure, not adding new ones. > > The clear prior art in Git's own tests for what t4080 wants to do, as of > today, is `t/helper/test-rot13-filter.c`, which could be imitated here > instead of (re-)introducing a dependency on a scripting language other > than Unix shell in Git's test suite. > > The `PYTHON` prereq exists in exactly five files today, all `git p4` > related (where Python is an intrinsic prerequisite given that `git-p4.py` > _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`, > `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`, > and `t/t9836-git-p4-metadata-encoding-python3.sh`. > > After 7cdbff14d482 (remove merge-recursive-old, 2006-11-20), this here > patch would be the first one, after almost 20 years, to re-introduce > Python as a dependency outside `git p4`. > > And it would also be the first ever to embed a Python script as a heredoc: > > > +fi > > + > > +# > > +# A single parametric diff process. > > +# Usage: diff-process-backend --mode=<mode> [--log=<path>] > > +# > > +# Modes: > > +# whole-file - report all lines as changed (default) > > +# fixed-hunk - always report hunk 5 2 5 2 > > +# bad-hunk - report out-of-bounds hunk 999 1 999 1 > > +# bad-sync - report hunk with mismatched unchanged totals > > +# overlap - report two overlapping hunks > > +# no-hunks - return no hunks (files considered equivalent) > > +# error - return status=error for every request > > +# abort - return status=abort for every request > > +# crash - read one request then exit without responding > > +# > > +setup_backend () { > > + cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF > > + import sys, os > > + > > + def read_pkt(): > > + hdr = sys.stdin.buffer.read(4) > > + if len(hdr) < 4: return None > > + length = int(hdr, 16) > > + if length == 0: return "" > > + data = sys.stdin.buffer.read(length - 4) > > + return data.decode().rstrip("\n") > > + > > + def write_pkt(line): > > + data = (line + "\n").encode() > > + sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) > > + sys.stdout.buffer.flush() > > + > > + def write_flush(): > > + sys.stdout.buffer.write(b"0000") > > + sys.stdout.buffer.flush() > > + > > + def read_content(): > > + chunks = [] > > + while True: > > + hdr = sys.stdin.buffer.read(4) > > + if len(hdr) < 4: break > > + length = int(hdr, 16) > > + if length == 0: break > > + chunks.append(sys.stdin.buffer.read(length - 4)) > > + return b"".join(chunks) > > + > > + mode = "whole-file" > > + logfile = None > > + for arg in sys.argv[1:]: > > + if arg.startswith("--mode="): > > + mode = arg[7:] > > + elif arg.startswith("--log="): > > + logfile = open(arg[6:], "a") > > + > > + def log(msg): > > + if logfile: > > + logfile.write(msg + "\n") > > + logfile.flush() > > + > > + # Handshake > > + assert read_pkt() == "git-diff-client" > > + assert read_pkt() == "version=1" > > + read_pkt() > > + write_pkt("git-diff-server") > > + write_pkt("version=1") > > + write_flush() > > + while True: > > + p = read_pkt() > > + if p == "": break > > + write_pkt("capability=hunks") > > + write_flush() > > + > > + log("ready") > > + > > + while True: > > + cmd = None > > + pathname = None > > + while True: > > + p = read_pkt() > > + if p is None: sys.exit(0) > > + if p == "": break > > + if p.startswith("command="): cmd = p.split("=",1)[1] > > + if p.startswith("pathname="): pathname = p.split("=",1)[1] > > + if cmd is None: sys.exit(0) > > + old = read_content() > > + new = read_content() > > + old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" > > + new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" > > + log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") > > + > > + if mode == "error": > > + write_flush() > > + write_pkt("status=error") > > + write_flush() > > + continue > > + > > + if mode == "abort": > > + write_flush() > > + write_pkt("status=abort") > > + write_flush() > > + continue > > + > > + if mode == "crash": > > + sys.exit(1) > > + > > + if cmd == "hunks": > > + if mode == "fixed-hunk": > > + write_pkt("hunk 5 2 5 2") > > + elif mode == "bad-hunk": > > + write_pkt("hunk 999 1 999 1") > > + elif mode == "bad-sync": > > + write_pkt("hunk 1 2 1 1") > > + elif mode == "overlap": > > + write_pkt("hunk 1 5 1 5") > > + write_pkt("hunk 3 2 3 2") > > + elif mode == "no-hunks": > > + pass > > + else: > > + ol = old.count(b"\n") > > + nl = new.count(b"\n") > > + write_pkt(f"hunk 1 {ol} 1 {nl}") > > + write_flush() > > + write_pkt("status=success") > > + write_flush() > > + else: > > + write_flush() > > + write_pkt("status=error") > > + write_flush() > > + PYEOF > > The existing pattern is to provide larger scripts as fixtures in > associated `t/tNNNN/` directories, not as heredoc, see e.g. > `t/t1509/prepare-chroot.sh`. Writing scripts, especially lengthy ones, in > heredoc strings makes it virtually impossible to use static code analysis > or syntax highlighting to fend off banal errors. > > Given the complexity of what t4080 tries to test (error, abort, crash, > bad-sync, no-hunks, multiple files in one session, capability > negotiation), it would unfortunately be infeasible to use `test-tool > pkt-line` from a shell script implementing that `diff.*.process` protocol. > > So I've spiked a demo how the `test-tool diff-process-backend` could look > like (letting Opus do the menial typing, so that I can enjoy at least part > of a sunny Sunday outside), which also passes the CI build and test: > https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6 > > That commit is of course not intended to be used as-is; Feel free to pick > code parts of it and integrate them into your topic branch. Or write your > own test-tool helper from scratch if that's more your jam. > Johannes, thank you for the great feedback. The historical context is really helpful and the concerns you raise make a lot of sense. I will take a look at your spike and also work on removing Python from the test. > Ciao, > Johannes > > > + write_script diff-process-backend <<-SHEOF > > + exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@" > > + SHEOF > > +} > > + > > +BACKEND="./diff-process-backend" > > + > > +test_expect_success PYTHON 'setup' ' > > + setup_backend && > > + echo "*.c diff=cdiff" >.gitattributes && > > + git add .gitattributes && > > + > > + # boundary.c: 10 lines, changes at 5-6 and 9-10. > > + # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap. > > + cat >boundary.c <<-\EOF && > > + line1 > > + line2 > > + line3 > > + line4 > > + OLD5 > > + OLD6 > > + line7 > > + line8 > > + OLD9 > > + OLD10 > > + EOF > > + git add boundary.c && > > + > > + # worddiff.c: single-line function, value changes 1 -> 999. > > + # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat. > > + cat >worddiff.c <<-\EOF && > > + int value(void) { return 1; } > > + EOF > > + git add worddiff.c && > > + > > + # newfile.c: single-line function, value changes 42 -> 99. > > + # Used by: new file, --exit-code, multiple drivers. > > + cat >newfile.c <<-\EOF && > > + int new_func(void) { return 42; } > > + EOF > > + git add newfile.c && > > + > > + # logtest.c: single-line function for log/format-patch tests. > > + # Needs two commits so log -1 has a diff. > > + cat >logtest.c <<-\EOF && > > + int logfunc(void) { return 1; } > > + EOF > > + git add logtest.c && > > + > > + # two.c/one.c: two-file pair for error/abort/startup-failure tests. > > + cat >one.c <<-\EOF && > > + int first(void) { return 1; } > > + EOF > > + cat >two.c <<-\EOF && > > + int second(void) { return 2; } > > + EOF > > + git add one.c two.c && > > + > > + git commit -m "initial" && > > + > > + # Second commit for logtest.c (so log -1 has something to show). > > + cat >logtest.c <<-\EOF && > > + int logfunc(void) { return 2; } > > + EOF > > + git add logtest.c && > > + git commit -m "change logtest.c" && > > + > > + # Working tree modifications (not committed). > > + cat >boundary.c <<-\EOF && > > + line1 > > + line2 > > + line3 > > + line4 > > + NEW5 > > + NEW6 > > + line7 > > + line8 > > + NEW9 > > + NEW10 > > + EOF > > + > > + cat >worddiff.c <<-\EOF && > > + int value(void) { return 999; } > > + EOF > > + > > + cat >newfile.c <<-\EOF && > > + int new_func(void) { return 99; } > > + EOF > > + > > + cat >one.c <<-\EOF && > > + int first(void) { return 10; } > > + EOF > > + > > + cat >two.c <<-\EOF > > + int second(void) { return 20; } > > + EOF > > +' > > + > > +# > > +# Core behavior: the tool controls which lines are marked as changed. > > +# > > + > > +test_expect_success PYTHON 'diff process hunk boundaries affect output' ' > > + # The file has changes at lines 5-6 and 9-10, but fixed-hunk > > + # only reports lines 5-6 as changed. Lines 9-10 should not > > + # appear as changed in the output. > > + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ > > + diff boundary.c >actual && > > + test_grep "^-OLD5" actual && > > + test_grep "^-OLD6" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^+NEW6" actual && > > + test_grep ! "^-OLD9" actual && > > + test_grep ! "^-OLD10" actual && > > + test_grep ! "^+NEW9" actual && > > + test_grep ! "^+NEW10" actual > > +' > > + > > +test_expect_success PYTHON 'diff process works with new file' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- newfile.c >actual 2>stderr && > > + test_grep "return 99" actual && > > + test_grep "pathname=newfile.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process works with added file (empty old side)' ' > > + cat >added.c <<-\EOF && > > + int added(void) { return 1; } > > + EOF > > + git add added.c && > > + > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --cached -- added.c >actual 2>stderr && > > + test_grep "added" actual && > > + test_grep "pathname=added.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process skipped for binary files' ' > > + printf "\\0binary" >binary.c && > > + git add binary.c && > > + git commit -m "add binary" && > > + printf "\\0changed" >binary.c && > > + > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- binary.c >actual && > > + test_grep "Binary files" actual && > > + test_path_is_missing backend.log > > +' > > + > > +test_expect_success PYTHON 'diff process not consulted for unmatched driver' ' > > + echo "not tracked by cdiff" >unmatched.txt && > > + git add unmatched.txt && > > + git commit -m "add unmatched.txt" && > > + > > + echo "modified" >unmatched.txt && > > + > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- unmatched.txt >actual && > > + test_grep "modified" actual && > > + test_path_is_missing backend.log > > +' > > + > > +test_expect_success PYTHON 'multiple drivers use separate processes' ' > > + echo "*.h diff=hdiff" >>.gitattributes && > > + git add .gitattributes && > > + > > + cat >multi.h <<-\EOF && > > + int header(void) { return 1; } > > + EOF > > + git add multi.h && > > + git commit -m "add multi.h" && > > + > > + cat >multi.h <<-\EOF && > > + int header(void) { return 2; } > > + EOF > > + > > + rm -f backend-c.log backend-h.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \ > > + -c diff.hdiff.process="$BACKEND --log=backend-h.log" \ > > + diff -- newfile.c multi.h >actual 2>stderr && > > + test_grep "pathname=newfile.c" backend-c.log && > > + test_grep "pathname=multi.h" backend-h.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process works alongside textconv' ' > > + write_script uppercase-filter <<-\EOF && > > + tr "a-z" "A-Z" <"$1" > > + EOF > > + > > + cat >textconv.c <<-\EOF && > > + hello world > > + EOF > > + git add textconv.c && > > + git commit -m "add textconv.c" && > > + > > + cat >textconv.c <<-\EOF && > > + goodbye world > > + EOF > > + > > + rm -f backend.log && > > + git -c diff.cdiff.textconv="./uppercase-filter" \ > > + -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- textconv.c >actual 2>stderr && > > + # The diff process receives textconv-transformed (uppercase) content. > > + test_grep "pathname=textconv.c" backend.log && > > + test_grep "old=HELLO WORLD" backend.log && > > + test_grep "new=GOODBYE WORLD" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +# > > +# Downstream features: word diff, log, equivalent files, exit code. > > +# > > + > > +test_expect_success PYTHON 'diff process with --word-diff' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --word-diff worddiff.c >actual 2>stderr && > > + test_grep "\[-1;-\]" actual && > > + test_grep "{+999;+}" actual && > > + test_grep "pathname=worddiff.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process works with git log -p' ' > > + # With no-hunks mode, the tool says the files are equivalent, > > + # so log -p should show the commit but no diff content. > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ > > + log -1 -p -- logtest.c >actual 2>stderr && > > + test_grep "change logtest.c" actual && > > + test_grep ! "return 2" actual && > > + test_grep "command=hunks pathname=logtest.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process no hunks suppresses diff output' ' > > + cat >nohunks.c <<-\EOF && > > + int zero(void) { return 0; } > > + EOF > > + git add nohunks.c && > > + git commit -m "add nohunks.c" && > > + > > + cat >nohunks.c <<-\EOF && > > + int zero(void) { return 999; } > > + EOF > > + > > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ > > + diff nohunks.c >actual && > > + test_must_be_empty actual > > +' > > + > > +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' ' > > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ > > + diff --exit-code nohunks.c > > +' > > + > > +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' ' > > + test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \ > > + diff --exit-code newfile.c > > +' > > + > > +# > > +# Bypass mechanisms: flags and commands that skip the diff process. > > +# > > + > > +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --diff-algorithm=patience worddiff.c >actual && > > + test_grep "return 999" actual && > > + test_path_is_missing backend.log > > +' > > + > > +test_expect_success PYTHON 'diff process not used by --stat' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --stat worddiff.c >actual && > > + test_grep "worddiff.c" actual && > > + test_path_is_missing backend.log > > +' > > + > > +# > > +# Error handling and fallback. > > +# > > + > > +test_expect_success PYTHON 'diff process fallback on tool error status' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ > > + diff boundary.c >actual 2>stderr && > > + # Fallback produces the full builtin diff (both change regions). > > + test_grep "^-OLD5" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^-OLD9" actual && > > + test_grep "^+NEW9" actual && > > + # Tool was contacted (it replied with error, not crash). > > + test_grep "command=hunks pathname=boundary.c" backend.log && > > + test_grep "diff process.*failed" stderr > > +' > > + > > +test_expect_success PYTHON 'diff process error keeps tool available for next file' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ > > + diff -- one.c two.c >actual 2>stderr && > > + # Unlike abort, error keeps the tool available: both files > > + # are sent to the tool (and both fall back). > > + test_grep "pathname=one.c" backend.log && > > + test_grep "pathname=two.c" backend.log && > > + test_grep "return 10" actual && > > + test_grep "return 20" actual > > +' > > + > > +test_expect_success PYTHON 'diff process abort disables for session' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ > > + diff -- one.c two.c >actual && > > + # Both files should still produce diff output via fallback. > > + test_grep "return 10" actual && > > + test_grep "return 20" actual && > > + # The tool aborts on the first file and git clears its > > + # capability. The second file never contacts the tool. > > + test_grep "pathname=one.c" backend.log && > > + test_grep ! "pathname=two.c" backend.log > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on tool crash' ' > > + git -c diff.cdiff.process="$BACKEND --mode=crash" \ > > + diff boundary.c >actual 2>stderr && > > + test_grep "^-OLD5" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^-OLD9" actual && > > + test_grep "^+NEW9" actual && > > + # Crash is a communication failure, so a warning is emitted. > > + test_grep "diff process.*failed" stderr > > +' > > + > > +test_expect_success PYTHON 'diff process startup failure only warns once' ' > > + git -c diff.cdiff.process="/nonexistent/tool" \ > > + diff -- one.c two.c >actual 2>stderr && > > + # Both files produce diff output via fallback. > > + test_grep "return 10" actual && > > + test_grep "return 20" actual && > > + # Sentinel prevents repeated warnings: only one, not one per file. > > + test_grep "diff process.*failed" stderr >warnings && > > + test_line_count = 1 warnings > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on bad hunks' ' > > + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ > > + diff boundary.c >actual 2>stderr && > > + test_grep "^-OLD5" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^-OLD9" actual && > > + test_grep "^+NEW9" actual && > > + # Invalid hunks are caught by xdiff validation, not the > > + # protocol layer, so no warning is emitted. > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' ' > > + cat >synctest.c <<-\EOF && > > + line1 > > + line2 > > + line3 > > + EOF > > + git add synctest.c && > > + git commit -m "add synctest.c" && > > + > > + cat >synctest.c <<-\EOF && > > + line1 > > + changed > > + line3 > > + EOF > > + > > + # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new > > + # line as changed, leaving 1 unchanged old vs 2 unchanged new. > > + # The synchronization invariant fails and git falls back. > > + git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \ > > + diff synctest.c >actual 2>stderr && > > + test_grep "changed" actual > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' > > + # boundary.c has 10 lines, so both hunks are in bounds > > + # but they overlap at lines 3-5, triggering the ordering check. > > + git -c diff.cdiff.process="$BACKEND --mode=overlap" \ > > + diff boundary.c >actual 2>stderr && > > + test_grep "NEW5" actual > > +' > > + > > +test_done > > diff --git a/userdiff.h b/userdiff.h > > index 51c26e0d41..a98eabe377 100644 > > --- a/userdiff.h > > +++ b/userdiff.h > > @@ -3,6 +3,7 @@ > > > > #include "notes-cache.h" > > > > +struct diff_subprocess; > > struct index_state; > > struct repository; > > > > @@ -33,6 +34,8 @@ struct userdiff_driver { > > int textconv_want_cache; > > const char *process; > > char *process_owned; > > + struct diff_subprocess *diff_subprocess; > > + unsigned diff_process_failed : 1; > > }; > > enum userdiff_driver_type { > > USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, > > -- > > gitgitgadget > > > > > >

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Sun, Jun 7, 2026 at 7:36 AM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Michael, > > I stumbled about this patch when it broke CI in Git for Windows, where we > do _not_ use `NO_PYTHON`, even though Python is unavailable in the > build/test CI jobs. The existing tests handle this situation gracefully, > this here patch does not: > > On Sun, 7 Jun 2026, Michael Montalbo via GitGitGadget wrote: > > > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > > new file mode 100755 > > index 0000000000..f159cd86d8 > > --- /dev/null > > +++ b/t/t4080-diff-process.sh > > @@ -0,0 +1,538 @@ > > +#!/bin/sh > > + > > +test_description='diff process via long-running process' > > + > > +. ./test-lib.sh > > + > > +if test_have_prereq PYTHON > > +then > > + PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) > > When neither `python3` nor `python` are available (which is the case in > the minimal Git for Windows SDK used in Git's CI runs), this fails under > `set -e`. Before even running the first test case. Resulting in an > unexpected TAP format error. > > Now, we could "fix" this by imitating what `lib-p4` does (see > https://github.com/dscho/git/commit/bd0b5570c744f678911a67a62da63f30655f20d8 > which demonstrates that it is indeed a work-around on Windows): > > -- snip -- > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > index fdf6da1c341e67..bd22c247ff3856 100755 > --- a/t/t4080-diff-process.sh > +++ b/t/t4080-diff-process.sh > @@ -4,9 +4,10 @@ test_description='diff process via long-running process' > > . ./test-lib.sh > > -if test_have_prereq PYTHON > +if ! test_have_prereq PYTHON || ! test -x "$PYTHON_PATH" > then > - PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python) > + skip_all='python interpreter not available' > + test_done > fi > > # > -- snap -- > > Of course, this uncovers _another_ problem with the Python script: It uses > Python3-only `f"..."` format strings, which cannot be handled by the > Python2 to which the `PYTHON_PATH` variable in `linux-TEST-vars` points. > So this requires _another follow-up (see also > https://github.com/dscho/git/commit/c12a9f4c80e5ce8db0fe370fac46fb45be2b775f): > > -- snip -- > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh > index bd22c247ff3856..ba14682a9086e4 100755 > --- a/t/t4080-diff-process.sh > +++ b/t/t4080-diff-process.sh > @@ -39,7 +39,8 @@ setup_backend () { > > def write_pkt(line): > data = (line + "\n").encode() > - sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) > + hdr = "{:04x}".format(len(data) + 4).encode() > + sys.stdout.buffer.write(hdr + data) > sys.stdout.buffer.flush() > > def write_flush(): > @@ -98,7 +99,8 @@ setup_backend () { > new = read_content() > old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" > new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" > - log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") > + log("command={} pathname={} old={} new={}".format( > + cmd, pathname, old_first, new_first)) > > if mode == "error": > write_flush() > @@ -130,7 +132,7 @@ setup_backend () { > else: > ol = old.count(b"\n") > nl = new.count(b"\n") > - write_pkt(f"hunk 1 {ol} 1 {nl}") > + write_pkt("hunk 1 {} 1 {}".format(ol, nl)) > write_flush() > write_pkt("status=success") > write_flush() > -- snap -- > > And this is still not enough to make it work with Python2, see > https://github.com/dscho/git/actions/runs/27091523842/job/79955895737: > > -- snip -- > [...] > + git -c diff.cdiff.process=./diff-process-backend --mode=fixed-hunk diff boundary.c > Traceback (most recent call last): > File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 45, in <module> > assert read_pkt() == "git-diff-client" > File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 4, in read_pkt > hdr = sys.stdin.buffer.read(4) > AttributeError: 'file' object has no attribute 'buffer' > -- snap -- > > I have experienced similar patterns in my career, where a single decision > required multiple follow-up fixes _just_ to avoid having to revert that > decision. This kind of doubling down has never ended well. > > Therefore I would like to take a step back, and ask: Is it _really_ a good > idea to use Python here? Are we certain that we want to _require_ Python > to run this test and skip it if Python isn't available (as is the case in > the Windows-related parts of Git's very own CI) even if Python has nothing > at all to do with the feature that is being tested? > > I don't want to be doomed to repeat history, and we can very well learn > e.g. from prior art in this very project, where the tests for the > clean/smudge filters (which _also_ want to speak pkt-line over stdio) > needlessly incurred Perl as a requirement to run the tests. It was > Matheus's heroic work in 52917a998ef3a (t0021: implementation the > rot13-filter.pl script in C, 2022-08-14) and 4d1d843be7a15 (tests: use the > new C rot13-filter helper to avoid PERL prereq, 2022-08-14) that avoided > that unnecessary prerequisite. > > Likewise, there is `test-tool pkt-line` intended for driving the pkt-line > protocol via simple shell scripts. > > So the conscious project direction has been: fold pkt-line test backends > into `test-tool` and drop the scripting-language prereq. Reintroducing the > same shape in Python would walk this back. > > Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl', > 2025-04-29) has been another clear sign that the Git project is actively > _removing_ scripting-language dependencies from the build and test > infrastructure, not adding new ones. > Now I wonder if the extension / addition of more Perl test infra with my other series: https://lore.kernel.org/git/pull.2135.git.1780559158.gitgitgadget@gmail.com/T/ also goes against the project direction w.r.t. removing scripting languages. Maybe I should also re-evaluate the usage of Perl there. I am leveraging existing shell parsing logic written in Perl, but maybe the preference for Perl-based lint rules is a mistake and should be avoided. > The clear prior art in Git's own tests for what t4080 wants to do, as of > today, is `t/helper/test-rot13-filter.c`, which could be imitated here > instead of (re-)introducing a dependency on a scripting language other > than Unix shell in Git's test suite. > > The `PYTHON` prereq exists in exactly five files today, all `git p4` > related (where Python is an intrinsic prerequisite given that `git-p4.py` > _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`, > `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`, > and `t/t9836-git-p4-metadata-encoding-python3.sh`. > > After 7cdbff14d482 (remove merge-recursive-old, 2006-11-20), this here > patch would be the first one, after almost 20 years, to re-introduce > Python as a dependency outside `git p4`. > > And it would also be the first ever to embed a Python script as a heredoc: > > > +fi > > + > > +# > > +# A single parametric diff process. > > +# Usage: diff-process-backend --mode=<mode> [--log=<path>] > > +# > > +# Modes: > > +# whole-file - report all lines as changed (default) > > +# fixed-hunk - always report hunk 5 2 5 2 > > +# bad-hunk - report out-of-bounds hunk 999 1 999 1 > > +# bad-sync - report hunk with mismatched unchanged totals > > +# overlap - report two overlapping hunks > > +# no-hunks - return no hunks (files considered equivalent) > > +# error - return status=error for every request > > +# abort - return status=abort for every request > > +# crash - read one request then exit without responding > > +# > > +setup_backend () { > > + cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF > > + import sys, os > > + > > + def read_pkt(): > > + hdr = sys.stdin.buffer.read(4) > > + if len(hdr) < 4: return None > > + length = int(hdr, 16) > > + if length == 0: return "" > > + data = sys.stdin.buffer.read(length - 4) > > + return data.decode().rstrip("\n") > > + > > + def write_pkt(line): > > + data = (line + "\n").encode() > > + sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data) > > + sys.stdout.buffer.flush() > > + > > + def write_flush(): > > + sys.stdout.buffer.write(b"0000") > > + sys.stdout.buffer.flush() > > + > > + def read_content(): > > + chunks = [] > > + while True: > > + hdr = sys.stdin.buffer.read(4) > > + if len(hdr) < 4: break > > + length = int(hdr, 16) > > + if length == 0: break > > + chunks.append(sys.stdin.buffer.read(length - 4)) > > + return b"".join(chunks) > > + > > + mode = "whole-file" > > + logfile = None > > + for arg in sys.argv[1:]: > > + if arg.startswith("--mode="): > > + mode = arg[7:] > > + elif arg.startswith("--log="): > > + logfile = open(arg[6:], "a") > > + > > + def log(msg): > > + if logfile: > > + logfile.write(msg + "\n") > > + logfile.flush() > > + > > + # Handshake > > + assert read_pkt() == "git-diff-client" > > + assert read_pkt() == "version=1" > > + read_pkt() > > + write_pkt("git-diff-server") > > + write_pkt("version=1") > > + write_flush() > > + while True: > > + p = read_pkt() > > + if p == "": break > > + write_pkt("capability=hunks") > > + write_flush() > > + > > + log("ready") > > + > > + while True: > > + cmd = None > > + pathname = None > > + while True: > > + p = read_pkt() > > + if p is None: sys.exit(0) > > + if p == "": break > > + if p.startswith("command="): cmd = p.split("=",1)[1] > > + if p.startswith("pathname="): pathname = p.split("=",1)[1] > > + if cmd is None: sys.exit(0) > > + old = read_content() > > + new = read_content() > > + old_first = old.split(b"\n")[0].decode(errors="replace") if old else "" > > + new_first = new.split(b"\n")[0].decode(errors="replace") if new else "" > > + log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}") > > + > > + if mode == "error": > > + write_flush() > > + write_pkt("status=error") > > + write_flush() > > + continue > > + > > + if mode == "abort": > > + write_flush() > > + write_pkt("status=abort") > > + write_flush() > > + continue > > + > > + if mode == "crash": > > + sys.exit(1) > > + > > + if cmd == "hunks": > > + if mode == "fixed-hunk": > > + write_pkt("hunk 5 2 5 2") > > + elif mode == "bad-hunk": > > + write_pkt("hunk 999 1 999 1") > > + elif mode == "bad-sync": > > + write_pkt("hunk 1 2 1 1") > > + elif mode == "overlap": > > + write_pkt("hunk 1 5 1 5") > > + write_pkt("hunk 3 2 3 2") > > + elif mode == "no-hunks": > > + pass > > + else: > > + ol = old.count(b"\n") > > + nl = new.count(b"\n") > > + write_pkt(f"hunk 1 {ol} 1 {nl}") > > + write_flush() > > + write_pkt("status=success") > > + write_flush() > > + else: > > + write_flush() > > + write_pkt("status=error") > > + write_flush() > > + PYEOF > > The existing pattern is to provide larger scripts as fixtures in > associated `t/tNNNN/` directories, not as heredoc, see e.g. > `t/t1509/prepare-chroot.sh`. Writing scripts, especially lengthy ones, in > heredoc strings makes it virtually impossible to use static code analysis > or syntax highlighting to fend off banal errors. > > Given the complexity of what t4080 tries to test (error, abort, crash, > bad-sync, no-hunks, multiple files in one session, capability > negotiation), it would unfortunately be infeasible to use `test-tool > pkt-line` from a shell script implementing that `diff.*.process` protocol. > > So I've spiked a demo how the `test-tool diff-process-backend` could look > like (letting Opus do the menial typing, so that I can enjoy at least part > of a sunny Sunday outside), which also passes the CI build and test: > https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6 > > That commit is of course not intended to be used as-is; Feel free to pick > code parts of it and integrate them into your topic branch. Or write your > own test-tool helper from scratch if that's more your jam. > > Ciao, > Johannes > > > + write_script diff-process-backend <<-SHEOF > > + exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@" > > + SHEOF > > +} > > + > > +BACKEND="./diff-process-backend" > > + > > +test_expect_success PYTHON 'setup' ' > > + setup_backend && > > + echo "*.c diff=cdiff" >.gitattributes && > > + git add .gitattributes && > > + > > + # boundary.c: 10 lines, changes at 5-6 and 9-10. > > + # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap. > > + cat >boundary.c <<-\EOF && > > + line1 > > + line2 > > + line3 > > + line4 > > + OLD5 > > + OLD6 > > + line7 > > + line8 > > + OLD9 > > + OLD10 > > + EOF > > + git add boundary.c && > > + > > + # worddiff.c: single-line function, value changes 1 -> 999. > > + # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat. > > + cat >worddiff.c <<-\EOF && > > + int value(void) { return 1; } > > + EOF > > + git add worddiff.c && > > + > > + # newfile.c: single-line function, value changes 42 -> 99. > > + # Used by: new file, --exit-code, multiple drivers. > > + cat >newfile.c <<-\EOF && > > + int new_func(void) { return 42; } > > + EOF > > + git add newfile.c && > > + > > + # logtest.c: single-line function for log/format-patch tests. > > + # Needs two commits so log -1 has a diff. > > + cat >logtest.c <<-\EOF && > > + int logfunc(void) { return 1; } > > + EOF > > + git add logtest.c && > > + > > + # two.c/one.c: two-file pair for error/abort/startup-failure tests. > > + cat >one.c <<-\EOF && > > + int first(void) { return 1; } > > + EOF > > + cat >two.c <<-\EOF && > > + int second(void) { return 2; } > > + EOF > > + git add one.c two.c && > > + > > + git commit -m "initial" && > > + > > + # Second commit for logtest.c (so log -1 has something to show). > > + cat >logtest.c <<-\EOF && > > + int logfunc(void) { return 2; } > > + EOF > > + git add logtest.c && > > + git commit -m "change logtest.c" && > > + > > + # Working tree modifications (not committed). > > + cat >boundary.c <<-\EOF && > > + line1 > > + line2 > > + line3 > > + line4 > > + NEW5 > > + NEW6 > > + line7 > > + line8 > > + NEW9 > > + NEW10 > > + EOF > > + > > + cat >worddiff.c <<-\EOF && > > + int value(void) { return 999; } > > + EOF > > + > > + cat >newfile.c <<-\EOF && > > + int new_func(void) { return 99; } > > + EOF > > + > > + cat >one.c <<-\EOF && > > + int first(void) { return 10; } > > + EOF > > + > > + cat >two.c <<-\EOF > > + int second(void) { return 20; } > > + EOF > > +' > > + > > +# > > +# Core behavior: the tool controls which lines are marked as changed. > > +# > > + > > +test_expect_success PYTHON 'diff process hunk boundaries affect output' ' > > + # The file has changes at lines 5-6 and 9-10, but fixed-hunk > > + # only reports lines 5-6 as changed. Lines 9-10 should not > > + # appear as changed in the output. > > + git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \ > > + diff boundary.c >actual && > > + test_grep "^-OLD5" actual && > > + test_grep "^-OLD6" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^+NEW6" actual && > > + test_grep ! "^-OLD9" actual && > > + test_grep ! "^-OLD10" actual && > > + test_grep ! "^+NEW9" actual && > > + test_grep ! "^+NEW10" actual > > +' > > + > > +test_expect_success PYTHON 'diff process works with new file' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- newfile.c >actual 2>stderr && > > + test_grep "return 99" actual && > > + test_grep "pathname=newfile.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process works with added file (empty old side)' ' > > + cat >added.c <<-\EOF && > > + int added(void) { return 1; } > > + EOF > > + git add added.c && > > + > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --cached -- added.c >actual 2>stderr && > > + test_grep "added" actual && > > + test_grep "pathname=added.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process skipped for binary files' ' > > + printf "\\0binary" >binary.c && > > + git add binary.c && > > + git commit -m "add binary" && > > + printf "\\0changed" >binary.c && > > + > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- binary.c >actual && > > + test_grep "Binary files" actual && > > + test_path_is_missing backend.log > > +' > > + > > +test_expect_success PYTHON 'diff process not consulted for unmatched driver' ' > > + echo "not tracked by cdiff" >unmatched.txt && > > + git add unmatched.txt && > > + git commit -m "add unmatched.txt" && > > + > > + echo "modified" >unmatched.txt && > > + > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- unmatched.txt >actual && > > + test_grep "modified" actual && > > + test_path_is_missing backend.log > > +' > > + > > +test_expect_success PYTHON 'multiple drivers use separate processes' ' > > + echo "*.h diff=hdiff" >>.gitattributes && > > + git add .gitattributes && > > + > > + cat >multi.h <<-\EOF && > > + int header(void) { return 1; } > > + EOF > > + git add multi.h && > > + git commit -m "add multi.h" && > > + > > + cat >multi.h <<-\EOF && > > + int header(void) { return 2; } > > + EOF > > + > > + rm -f backend-c.log backend-h.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \ > > + -c diff.hdiff.process="$BACKEND --log=backend-h.log" \ > > + diff -- newfile.c multi.h >actual 2>stderr && > > + test_grep "pathname=newfile.c" backend-c.log && > > + test_grep "pathname=multi.h" backend-h.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process works alongside textconv' ' > > + write_script uppercase-filter <<-\EOF && > > + tr "a-z" "A-Z" <"$1" > > + EOF > > + > > + cat >textconv.c <<-\EOF && > > + hello world > > + EOF > > + git add textconv.c && > > + git commit -m "add textconv.c" && > > + > > + cat >textconv.c <<-\EOF && > > + goodbye world > > + EOF > > + > > + rm -f backend.log && > > + git -c diff.cdiff.textconv="./uppercase-filter" \ > > + -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff -- textconv.c >actual 2>stderr && > > + # The diff process receives textconv-transformed (uppercase) content. > > + test_grep "pathname=textconv.c" backend.log && > > + test_grep "old=HELLO WORLD" backend.log && > > + test_grep "new=GOODBYE WORLD" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +# > > +# Downstream features: word diff, log, equivalent files, exit code. > > +# > > + > > +test_expect_success PYTHON 'diff process with --word-diff' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --word-diff worddiff.c >actual 2>stderr && > > + test_grep "\[-1;-\]" actual && > > + test_grep "{+999;+}" actual && > > + test_grep "pathname=worddiff.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process works with git log -p' ' > > + # With no-hunks mode, the tool says the files are equivalent, > > + # so log -p should show the commit but no diff content. > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \ > > + log -1 -p -- logtest.c >actual 2>stderr && > > + test_grep "change logtest.c" actual && > > + test_grep ! "return 2" actual && > > + test_grep "command=hunks pathname=logtest.c" backend.log && > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process no hunks suppresses diff output' ' > > + cat >nohunks.c <<-\EOF && > > + int zero(void) { return 0; } > > + EOF > > + git add nohunks.c && > > + git commit -m "add nohunks.c" && > > + > > + cat >nohunks.c <<-\EOF && > > + int zero(void) { return 999; } > > + EOF > > + > > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ > > + diff nohunks.c >actual && > > + test_must_be_empty actual > > +' > > + > > +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' ' > > + git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \ > > + diff --exit-code nohunks.c > > +' > > + > > +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' ' > > + test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \ > > + diff --exit-code newfile.c > > +' > > + > > +# > > +# Bypass mechanisms: flags and commands that skip the diff process. > > +# > > + > > +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --diff-algorithm=patience worddiff.c >actual && > > + test_grep "return 999" actual && > > + test_path_is_missing backend.log > > +' > > + > > +test_expect_success PYTHON 'diff process not used by --stat' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --log=backend.log" \ > > + diff --stat worddiff.c >actual && > > + test_grep "worddiff.c" actual && > > + test_path_is_missing backend.log > > +' > > + > > +# > > +# Error handling and fallback. > > +# > > + > > +test_expect_success PYTHON 'diff process fallback on tool error status' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ > > + diff boundary.c >actual 2>stderr && > > + # Fallback produces the full builtin diff (both change regions). > > + test_grep "^-OLD5" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^-OLD9" actual && > > + test_grep "^+NEW9" actual && > > + # Tool was contacted (it replied with error, not crash). > > + test_grep "command=hunks pathname=boundary.c" backend.log && > > + test_grep "diff process.*failed" stderr > > +' > > + > > +test_expect_success PYTHON 'diff process error keeps tool available for next file' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \ > > + diff -- one.c two.c >actual 2>stderr && > > + # Unlike abort, error keeps the tool available: both files > > + # are sent to the tool (and both fall back). > > + test_grep "pathname=one.c" backend.log && > > + test_grep "pathname=two.c" backend.log && > > + test_grep "return 10" actual && > > + test_grep "return 20" actual > > +' > > + > > +test_expect_success PYTHON 'diff process abort disables for session' ' > > + rm -f backend.log && > > + git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \ > > + diff -- one.c two.c >actual && > > + # Both files should still produce diff output via fallback. > > + test_grep "return 10" actual && > > + test_grep "return 20" actual && > > + # The tool aborts on the first file and git clears its > > + # capability. The second file never contacts the tool. > > + test_grep "pathname=one.c" backend.log && > > + test_grep ! "pathname=two.c" backend.log > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on tool crash' ' > > + git -c diff.cdiff.process="$BACKEND --mode=crash" \ > > + diff boundary.c >actual 2>stderr && > > + test_grep "^-OLD5" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^-OLD9" actual && > > + test_grep "^+NEW9" actual && > > + # Crash is a communication failure, so a warning is emitted. > > + test_grep "diff process.*failed" stderr > > +' > > + > > +test_expect_success PYTHON 'diff process startup failure only warns once' ' > > + git -c diff.cdiff.process="/nonexistent/tool" \ > > + diff -- one.c two.c >actual 2>stderr && > > + # Both files produce diff output via fallback. > > + test_grep "return 10" actual && > > + test_grep "return 20" actual && > > + # Sentinel prevents repeated warnings: only one, not one per file. > > + test_grep "diff process.*failed" stderr >warnings && > > + test_line_count = 1 warnings > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on bad hunks' ' > > + git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \ > > + diff boundary.c >actual 2>stderr && > > + test_grep "^-OLD5" actual && > > + test_grep "^+NEW5" actual && > > + test_grep "^-OLD9" actual && > > + test_grep "^+NEW9" actual && > > + # Invalid hunks are caught by xdiff validation, not the > > + # protocol layer, so no warning is emitted. > > + test_must_be_empty stderr > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' ' > > + cat >synctest.c <<-\EOF && > > + line1 > > + line2 > > + line3 > > + EOF > > + git add synctest.c && > > + git commit -m "add synctest.c" && > > + > > + cat >synctest.c <<-\EOF && > > + line1 > > + changed > > + line3 > > + EOF > > + > > + # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new > > + # line as changed, leaving 1 unchanged old vs 2 unchanged new. > > + # The synchronization invariant fails and git falls back. > > + git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \ > > + diff synctest.c >actual 2>stderr && > > + test_grep "changed" actual > > +' > > + > > +test_expect_success PYTHON 'diff process fallback on overlapping hunks' ' > > + # boundary.c has 10 lines, so both hunks are in bounds > > + # but they overlap at lines 3-5, triggering the ordering check. > > + git -c diff.cdiff.process="$BACKEND --mode=overlap" \ > > + diff boundary.c >actual 2>stderr && > > + test_grep "NEW5" actual > > +' > > + > > +test_done > > diff --git a/userdiff.h b/userdiff.h > > index 51c26e0d41..a98eabe377 100644 > > --- a/userdiff.h > > +++ b/userdiff.h > > @@ -3,6 +3,7 @@ > > > > #include "notes-cache.h" > > > > +struct diff_subprocess; > > struct index_state; > > struct repository; > > > > @@ -33,6 +34,8 @@ struct userdiff_driver { > > int textconv_want_cache; > > const char *process; > > char *process_owned; > > + struct diff_subprocess *diff_subprocess; > > + unsigned diff_process_failed : 1; > > }; > > enum userdiff_driver_type { > > USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0, > > -- > > gitgitgadget > > > > > >

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > So the conscious project direction has been: fold pkt-line test backends > into `test-tool` and drop the scripting-language prereq. Reintroducing the > same shape in Python would walk this back. > ... > The `PYTHON` prereq exists in exactly five files today, all `git p4` > related (where Python is an intrinsic prerequisite given that `git-p4.py` > _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`, > `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`, > and `t/t9836-git-p4-metadata-encoding-python3.sh`. > ... > That commit is of course not intended to be used as-is; Feel free to pick > code parts of it and integrate them into your topic branch. Or write your > own test-tool helper from scratch if that's more your jam. Showing better direction to new folks with such a clear thinking is very much appreciated. Even though it is natural and perfectly OK for tests that interacts with parts of Git that are written in these languages (e.g., we are OK for gitweb tests to require Perl), we should consciously keep ourselves clean and not adding unnecessary dependencies.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Michael Montalbo <mmontalbo@gmail.com> writes: > On Sun, Jun 7, 2026 at 7:36 AM Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: >> >> Hi Michael, >> >> I stumbled about this patch when it broke CI in Git for Windows, where we >> do _not_ use `NO_PYTHON`, even though Python is unavailable in the >> build/test CI jobs. The existing tests handle this situation gracefully, >> this here patch does not: >> ... >> Given the complexity of what t4080 tries to test (error, abort, crash, >> bad-sync, no-hunks, multiple files in one session, capability >> negotiation), it would unfortunately be infeasible to use `test-tool >> pkt-line` from a shell script implementing that `diff.*.process` protocol. >> >> So I've spiked a demo how the `test-tool diff-process-backend` could look >> like (letting Opus do the menial typing, so that I can enjoy at least part >> of a sunny Sunday outside), which also passes the CI build and test: >> https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6 >> >> That commit is of course not intended to be used as-is; Feel free to pick >> code parts of it and integrate them into your topic branch. Or write your >> own test-tool helper from scratch if that's more your jam. >> > > Johannes, thank you for the great feedback. The historical context is > really helpful and > the concerns you raise make a lot of sense. I will take a look at your > spike and also work > on removing Python from the test. Another request. Please do not force readers to scroll through a ~800 line message just to read only 5 lines of response from you. Keep relevant parts of the message you are responding to in your message to help readers understand the context in which your response was made, but trim everything else that is not relevant from your quote. Thanks.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Michael Montalbo <mmontalbo@gmail.com> writes: >> So the conscious project direction has been: fold pkt-line test backends >> into `test-tool` and drop the scripting-language prereq. Reintroducing the >> same shape in Python would walk this back. >> >> Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl', >> 2025-04-29) has been another clear sign that the Git project is actively >> _removing_ scripting-language dependencies from the build and test >> infrastructure, not adding new ones. > > Now I wonder if the extension / addition of more Perl test infra with my other > series: > > https://lore.kernel.org/git/pull.2135.git.1780559158.gitgitgadget@gmail.com/T/ > > also goes against the project direction w.r.t. removing scripting languages. > Maybe I should also re-evaluate the usage of Perl there. I am leveraging > existing shell parsing logic written in Perl, but maybe the preference for > Perl-based lint rules is a mistake and should be avoided. That sounds prudent (even though it is outside the scope of _this_ topic, of course). Thanks.

-Original file line number
+Diff line change
@@ Expand Up / @@ -218,6 +218,11 @@ endif::git-diff[] @@
     	Set this option to `true` to make the diff driver cache the text
     	conversion outputs.  See linkgit:gitattributes[5] for details.
+    `diff.<driver>.process`::
+    	The command to run as a long-running diff process that
+    	provides hunks to Git's diff pipeline.
+    	See linkgit:gitattributes[5] for details.
     `diff.indentHeuristic`::
     	Set this option to `false` to disable the default heuristics
     	that shift diff hunk boundaries to make patches easier to read.
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -18,3 +18,6 @@ @@
     For instance, if you configured the `diff.algorithm` variable to a
     non-default value and want to use the default one, then you
     have to use `--diff-algorithm=default` option.
+    +
+    If you explicitly choose a diff algorithm, it also bypasses
+    `diff.<driver>.process` (see linkgit:gitattributes[5]).

-Original file line number
+Diff line change
@@ Expand Up / @@ -833,7 +833,9 @@ endif::git-format-patch[] @@
     	to use this option with linkgit:git-log[1] and friends.
     `--no-ext-diff`::
-    	Disallow external diff drivers.
+    	Disallow external diff helpers, including
+    	`diff.<driver>.command` and `diff.<driver>.process`
+    	(see linkgit:gitattributes[5]).
     `--textconv`::
     `--no-textconv`::
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -811,6 +811,7 @@ TEST_BUILTINS_OBJS += test-csprng.o @@
     TEST_BUILTINS_OBJS += test-date.o
     TEST_BUILTINS_OBJS += test-delete-gpgsig.o
     TEST_BUILTINS_OBJS += test-delta.o
+    TEST_BUILTINS_OBJS += test-diff-process-backend.o
     TEST_BUILTINS_OBJS += test-dir-iterator.o
     TEST_BUILTINS_OBJS += test-drop-caches.o
     TEST_BUILTINS_OBJS += test-dump-cache-tree.o
@@ Expand Down Expand Up / @@ -1140,6 +1141,7 @@ LIB_OBJS += diff-delta.o @@
     LIB_OBJS += diff-merges.o
     LIB_OBJS += diff-lib.o
     LIB_OBJS += diff-no-index.o
+    LIB_OBJS += diff-process.o
     LIB_OBJS += diff.o
     LIB_OBJS += diffcore-break.o
     LIB_OBJS += diffcore-delta.o
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -19,6 +19,8 @@ @@
     #include "tag.h"
     #include "trace2.h"
     #include "blame.h"
+    #include "diff-process.h"
+    #include "xdiff-interface.h"
     #include "alloc.h"
     #include "commit-slab.h"
     #include "bloom.h"
@@ Expand Down Expand Up @@
-    static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b,
-    		      xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts)
+    static int diff_hunks_xpp(mmfile_t *file_a, mmfile_t *file_b,
+    			  xdl_emit_hunk_consume_func_t hunk_func,
+    			  void *cb_data, xpparam_t *xpp)
     {
-    	xpparam_t xpp = {0};
     	xdemitconf_t xecfg = {0};
     	xdemitcb_t ecb = {NULL};
-    	xpp.flags = xdl_opts;
     	xecfg.hunk_func = hunk_func;
     	ecb.priv = cb_data;
-    	return xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb);
+    	return xdi_diff(file_a, file_b, xpp, &xecfg, &ecb);
+    }
+    static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b,
+    		      xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts)
+    {
+    	xpparam_t xpp = {0};
+    	xpp.flags = xdl_opts;
+    	return diff_hunks_xpp(file_a, file_b, hunk_func, cb_data, &xpp);
     }
     static const char *get_next_line(const char *start, const char *end)
@@ Expand Down Expand Up @@
     				 struct blame_origin *parent, int ignore_diffs)
     {
     	mmfile_t file_p, file_o;
+    	xpparam_t xpp = {0};
     	struct blame_chunk_cb_data d;
     	struct blame_entry *newdest = NULL;
@@ Expand All @@
     			 &sb->num_read_blob, ignore_diffs);
     	sb->num_get_patch++;
-    	if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts))
-    		die("unable to generate diff (%s -> %s)",
-    		    oid_to_hex(&parent->commit->object.oid),
-    		    oid_to_hex(&target->commit->object.oid));
+    	xpp.flags = sb->xdl_opts;
+    	/*
+    	 * Whitespace-ignoring options are not communicated to the diff
+    	 * process, so it could not honor them; consult it only when none
+    	 * are set, and otherwise use the builtin diff (which does honor
+    	 * them via xpp.flags), matching the bypass in
+    	 * diff_process_fill_hunks() for "git diff".  If the process
+    	 * considers the files equivalent, skip the diff so blame looks
+    	 * past this commit.
+    	 */
+    	if ((xpp.flags & (XDF_WHITESPACE_FLAGS | XDF_IGNORE_BLANK_LINES)) ||
+    	    diff_process_fill_hunks(&sb->revs->diffopt, target->path,
+    				    &file_p, &file_o, &xpp)
+    	    != DIFF_PROCESS_EQUIVALENT) {
+    		if (diff_hunks_xpp(&file_p, &file_o, blame_chunk_cb,
+    				   &d, &xpp))
+    			die("unable to generate diff (%s -> %s)",
+    			    oid_to_hex(&parent->commit->object.oid),
+    			    oid_to_hex(&target->commit->object.oid));
+    	}
+    	free(xpp.external_hunks);
     	/* The rest are the same as the parent */
     	blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, 0,
     		    parent, target, 0);
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] diff: add `diff.<driver>.process` for external hunk providers #2120

Diff view

Diff view

There are no files selected for viewing

gitgitgadget Bot May 26, 2026

Uh oh!

gitgitgadget Bot May 29, 2026

Uh oh!

gitgitgadget Bot May 26, 2026

Uh oh!

gitgitgadget Bot May 29, 2026

Uh oh!

gitgitgadget Bot Jun 7, 2026

Uh oh!

gitgitgadget Bot Jun 7, 2026

Uh oh!

gitgitgadget Bot Jun 7, 2026

Uh oh!

gitgitgadget Bot Jun 8, 2026

Uh oh!

gitgitgadget Bot Jun 8, 2026

Uh oh!

gitgitgadget Bot Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

-Original file line number
+Diff line change
@@ Expand Up / @@ -784,6 +784,16 @@ with the above configuration, i.e. `j-c-diff`, with 7 @@
     parameters, just like `GIT_EXTERNAL_DIFF` program is called.
     See linkgit:git[1] for details.
+    An external diff driver replaces the patch Git would otherwise
+    produce for the path: Git runs the command and shows its output in
+    place of its own.  Output features that post-process Git's diff do
+    not apply to it; word diff, function context (`-W`), `--color-moved`,
+    and coloring all act on Git's builtin diff, not the driver's output.
+    The driver is consulted only when Git generates a textual patch.  The
+    summary formats (`--stat`, `--numstat`, `--shortstat`, and
+    `--dirstat`), `git blame`, and `git log -L` do not run it and
+    continue to use Git's builtin diff.
     If the program is able to ignore certain changes (similar to
     `git diff --ignore-space-change`), then also set the option
     `trustExitCode` to true.  It is then expected to return exit code 1 if
@@ Expand Down Expand Up @@
     (see above), and adding `diff.<name>.algorithm` has no effect, as the
     algorithm is not passed to the external diff driver.
+    Using an external diff process
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    If `diff.<name>.process` is defined, Git sends the old and new file
+    content to an external tool and receives back a list of changed
+    regions (pairs of line ranges in the old and new file).  Git uses
+    these instead of its builtin diff algorithm, but still controls
+    all output formatting, so features like word diff, function context,
+    color, and blame work normally.  This is achieved by using the
+    long-running process protocol (described in
+    Documentation/technical/long-running-process-protocol.adoc).
+    Unlike `diff.<name>.command`, which replaces Git's output entirely,
+    the diff process feeds results back into the standard pipeline.  If
+    both are configured for a path, `diff.<name>.command` takes precedence
+    for the output it produces: since it replaces the diff, the process is
+    not consulted there.
+    First, in `.gitattributes`, assign the `diff` attribute for paths.
+    ------------------------
+    *.c diff=cdiff
+    ------------------------
+    Then, define a "diff.<name>.process" configuration to specify
+    the diff process command.
+    ----------------------------------------------------------------
+    [diff "cdiff"]
+      process = /path/to/diff-process-tool
+    ----------------------------------------------------------------
+    When Git encounters the first file that needs to be diffed, it starts
+    the process and performs the handshake.  In the handshake, the welcome
+    message sent by Git is "git-diff-client", only version 1 is supported,
+    and the supported capability is "hunks" (the changed regions
+    described below).
+    For each file, Git sends a list of "key=value" pairs terminated with
+    a flush packet, followed by the old and new file content as packetized
+    data, each terminated with a flush packet.  The pathname is relative
+    to the repository root.  When `diff.<name>.textconv` is also set,
+    the tool receives the textconv-transformed content rather than the
+    raw blob, matching what the consuming feature itself diffs: patch
+    output is textconv'd, the summary formats (noted below) are not, and
+    `git blame` applies textconv only under `--textconv`.  Git does not
+    send binary files to the diff process.
+    -----------------------
+    packet:          git> command=hunks
+    packet:          git> pathname=path/file.c
+    packet:          git> 0000
+    packet:          git> OLD_CONTENT
+    packet:          git> 0000
+    packet:          git> NEW_CONTENT
+    packet:          git> 0000
+    -----------------------
+    The tool is expected to respond with zero or more hunk lines,
+    a flush packet, and a status packet terminated with a flush packet.
+    Each hunk line has the form:
+      `hunk <old_start> <old_count> <new_start> <new_count>`
+    where `<old_start>` and `<old_count>` identify a range of lines in
+    the old file, and `<new_start>` and `<new_count>` identify the
+    replacement range in the new file.  Start values are 1-based and
+    counts are non-negative.  Ranges must not extend beyond the end of
+    the file.  For example, `hunk 3 2 3 4` means that 2 lines starting
+    at line 3 in the old file were replaced by 4 lines starting at
+    line 3 in the new file.  An `<old_count>` of 0 means no lines were
+    removed (pure insertion); a `<new_count>` of 0 means no lines were
+    added (pure deletion).  A start value of 0 is accepted when
+    the corresponding count is 0 (e.g., `hunk 0 0 1 5` for a newly
+    added file), matching what `git diff` itself emits for empty
+    file sides.  Git ignores any extra whitespace-separated tokens after
+    `<new_count>`, so a future protocol version can append fields to a
+    hunk line (for example a "moved" marker) without older clients
+    rejecting it.
+    Lines are delimited by newlines.  A file `"foo\nbar\n"` and a
+    file `"foo\nbar"` both have 2 lines.
+    Hunks must be listed in order and must not overlap.  Any line not
+    covered by a hunk is treated as unchanged and is paired, in order,
+    with the unchanged lines on the other side.  Each run of unchanged
+    lines between two hunks (and the run before the first hunk and
+    after the last) must therefore be the same length on both sides,
+    not merely equal in total.  For the hunks `1 3 1 5` and `10 2 12 2`
+    below, lines 4-9 of the old file and lines 6-11 of the new file are
+    both the six unchanged lines between the two hunks.  A response that
+    balances only the total unchanged count but misaligns one of these
+    runs is rejected, and Git falls back to the builtin diff.
+    Git does not check that the lines a hunk leaves unchanged are
+    byte-for-byte identical between the two sides; it pairs them by
+    position and shows the new side as context.  A tool may therefore
+    report lines that differ textually (a pure reformatting, say) as
+    unchanged, and the diff reflects that judgment.  This is
+    the point of a semantic backend, but it means a misbehaving tool can
+    produce a diff whose context does not match the old blob; as with
+    `git diff -w`, such a patch may not apply against the old content.
+    -----------------------
+    packet:          git< hunk 1 3 1 5
+    packet:          git< hunk 10 2 12 2
+    packet:          git< 0000
+    packet:          git< status=success
+    packet:          git< 0000
+    -----------------------
+    If the tool responds with hunks and "success", Git marks those lines
+    as changed and feeds them into the standard diff pipeline.  Git may
+    still slide or regroup those changes against matching context for
+    display, exactly as it compacts its own diffs, so the tool controls
+    which lines are reported as changed, not the precise hunk boundaries.
+    Patch output features (word diff, function context, color) work
+    normally, as do summary formats like `--stat`.  Not every feature
+    consults the process, though; see "Which features consult the diff
+    process" below for the full picture and the reasoning behind it.
+    If no hunk lines precede the flush, followed by "success", Git
+    treats the files as having no changes: `git diff` produces no output,
+    `git diff --exit-code` and `--quiet` report success even though the
+    stored blobs differ, and `git blame` skips the commit, attributing
+    lines to earlier commits.
+    The one exception is a change that only adds or removes the file's
+    trailing newline: it cannot be expressed as line hunks, so when the
+    line content otherwise matches Git keeps the builtin diff for that
+    file (preserving the `\ No newline at end of file` marker) instead of
+    treating the two sides as equal.
+    -----------------------
+    packet:          git< 0000
+    packet:          git< status=success
+    packet:          git< 0000
+    -----------------------
+    If the tool returns invalid hunks (out of bounds, overlapping, or
+    mismatched unchanged line counts), Git warns and falls back to the
+    builtin diff algorithm.
+    In case the tool cannot or does not want to process the content,
+    it is expected to respond with an "error" status.  Git warns and
+    falls back to the builtin diff algorithm for this file.  The tool
+    remains available for subsequent files.
+    -----------------------
+    packet:          git< 0000
+    packet:          git< status=error
+    packet:          git< 0000
+    -----------------------
+    In case the tool cannot or does not want to process the content as
+    well as any future content for the lifetime of the Git process, it
+    is expected to respond with an "abort" status.  Git silently falls
+    back to the builtin diff algorithm for this file and does not send
+    further requests to the tool.
+    -----------------------
+    packet:          git< 0000
+    packet:          git< status=abort
+    packet:          git< 0000
+    -----------------------
+    If the tool dies during the communication or does not adhere to the
+    protocol then Git will stop the process and fall back to the builtin
+    diff algorithm.  Git warns once and does not restart the process for
+    subsequent files.
+    Tools should ignore unknown keys in the per-file request to remain
+    forward-compatible.  Future versions of Git may send additional
+    `command=` values; tools that receive an unrecognized command should
+    respond with `status=error` rather than terminating.
+    Which features consult the diff process
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    The diff process answers a single question: given two blobs, which
+    line ranges differ?  Whether a particular feature consults it follows
+    from whether that is the question the feature is really asking.
+    Features that ask "which lines changed" use the tool's hunks in place
+    of the builtin algorithm:
+    - `git diff` patch output, together with everything layered on it:
+      word diff, function context (`-W`), `--color-moved`, and the `@@`
+      hunk headers.  These operate on the lines the patch step already
+      emitted, so they reflect the tool's hunks without any further
+      negotiation.
+    - `git blame`: a commit whose change the tool reports as equivalent is
+      skipped, and its lines are attributed to an earlier commit.
+    - `git log -L`: both the line-range display and the underlying range
+      tracking consult the tool, so a commit it reports as equivalent is
+      dropped from the log (its tracked range maps across unchanged)
+      rather than selected and then shown with an empty diff.
+    - `--stat`, `--numstat`, and `--shortstat`: the inserted and deleted
+      counts come from the tool's hunks, so a file the tool calls
+      equivalent contributes no stat line, matching the empty patch that
+      `git diff` produces for it.  These summary formats do not apply
+      textconv (just as the builtin summary path does not), so the tool
+      is consulted on the raw blob content even when a `textconv` is also
+      configured for patch output; this mirrors how builtin `--stat`
+      already counts raw lines rather than the textconv'd view.  The
+      line-counting `--dirstat=lines` uses these same counts; the default
+      `--dirstat`, which weighs byte changes, is computed on its own and
+      does not consult the tool.
+    Features that ask a different question do not consult the process, by
+    design:
+    - The pickaxe `-G<regex>` searches the textual diff for a pattern; it
+      asks "does this string appear in the diff," not "did these lines
+      change."  (`-S` runs at an earlier stage and is likewise unaffected.)
+    - `git patch-id` must produce a stable hash for `git rebase` and
+      cherry-pick detection; deriving it from a configured tool would make
+      equal patches hash differently from machine to machine.
+    - The merge machinery (`git merge-tree`, `rerere`) computes merge
+      content and conflict signatures rather than display output, so the
+      tool's hunks must not alter its results.
+    - `git range-diff` diffs patch text, not source blobs, so source-file
+      hunks do not apply to it.
+    - `--check` reports whitespace errors in added lines using the builtin
+      diff's notion of which lines are added, not the tool's.  It can
+      therefore flag (and exit non-zero on) a line the tool treats as
+      unchanged and that `git diff` shows as context.  Whitespace breakage
+      is a property of the literal bytes, so `--check` keeps the builtin
+      partition deliberately; a future change could wire it to the tool if
+      matching `git diff` exactly became desirable.
+    - `--raw`, `--name-only`, and `--name-status` compare object ids at
+      the tree level and never run a line-level diff at all.
+    Combined diffs (`--cc` and merge diffs) ask "which lines changed" but
+    still use the builtin algorithm, and may consult the process in a
+    later change; their protocol would have to be extended from a single
+    old/new pair to one comparison per merge parent.
+    `--no-ext-diff` and `--diff-algorithm` bypass the process entirely,
+    for every feature listed above.  The whitespace-ignoring options
+    (`-w`, `--ignore-space-change`, `--ignore-blank-lines`, and the like),
+    `-I<regex>`, and `--anchored` also bypass it for the affected files:
+    the tool is never told about these options, so it could not honor
+    them, and Git falls back to the builtin diff, which does.
     Defining a custom hunk-header
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ Expand Down @@

Uh oh!

[RFC] diff: add diff.<driver>.process for external hunk providers #2120

Are you sure you want to change the base?

[RFC] diff: add diff.<driver>.process for external hunk providers #2120

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

gitgitgadget Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[RFC] diff: add `diff.<driver>.process` for external hunk providers #2120

[RFC] diff: add `diff.<driver>.process` for external hunk providers #2120