Detect file paths preceded by Unicode/CJK punctuation#10250
Conversation
|
I'm starting a first review of this pull request. You can view the conversation on Warp. I reviewed this pull request and requested human review from: Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR updates file-path boundary detection to recognize non-ASCII whitespace and selected Unicode punctuation, fixes candidate substring indexing for multi-byte separators, and uses Unicode display width for separator fragments.
Concerns
- No blocking correctness, security, or performance concerns found in the annotated diff.
Verdict
Found: 0 critical, 0 important, 0 suggestions
Approve
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
|
hey @SagarSDagdu - #10004 (merged earlier today) also expanded the set of file link separators and added general support for non-ASCII ones. mind rebasing on top of those changes? i think it'll mostly simplify your PR down to |
570ca63 to
81f72dd
Compare
|
Rebased on master to pick up #10004's multi-byte indexing refactor that already covered the byte-range fix I'd duplicated, so this PR now reduces to:
Please review @vorporeal. |
vorporeal
left a comment
There was a problem hiding this comment.
looks great; thank you!
|
@vorporeal I think you'll have to kickoff the workflows manually on this PR to merge it, as I had to resolve some conflicts |
Clickable file path detection didn't recognize paths preceded by CJK / full-width punctuation (`:`, `(`, `「`, etc.) — common in CJK prose and AI CLI output like `路径:/path/to/file`. Replaces the four `FILE_LINK_SEPARATORS.contains(&c)` sites in `grid_handler` and the one in `link_detection` with a new `is_file_link_separator` helper. The helper accepts the existing ASCII-and-box-drawing set, plus any non-ASCII Unicode whitespace and general categories `Po | Ps | Pe | Pi | Pf`. Connectors (`Pc`), dashes (`Pd`), and CJK letters (`Lo`) are deliberately excluded so paths containing `_`, `-`, or CJK characters in their names still detect as a single token. Also fixes a latent issue in `line_to_fragments`: the separator fragment hardcoded `total_cell_width = 1`, but full-width separators visually occupy two cells; uses `UnicodeWidthChar::width` instead. Fixes warpdotdev#10245.
Head branch was pushed to by a user without write access
41ef721 to
d50a742
Compare
## Description Clickable file path detection in the terminal grid relied on a hardcoded ASCII separator set, so a path immediately preceded by CJK / full-width punctuation was not detected as clickable. This affects CJK prose and AI CLI output like: ``` 路径:/Users/me/project/plan.md ← was NOT clickable (fullwidth :touching /) 路径: /Users/me/project/plan.md ← clickable (ASCII : + space) /Users/me/project/plan.md ← clickable (bare) ``` This change replaces every `FILE_LINK_SEPARATORS.contains(&c)` site with a new `is_file_link_separator(c)` helper that also accepts non-ASCII Unicode whitespace and general categories `Po | Ps | Pe | Pi | Pf`. Connectors (`Pc`), dashes (`Pd`), and CJK letters (`Lo`) are deliberately excluded so paths containing `_`, `-`, or CJK characters in their names (e.g. `/path/音楽/テスト.txt`) continue to detect as a single token. Two latent issues that became visible once multi-byte separators are accepted are also fixed: - `possible_file_paths_in_word` indexed past a separator with `+ 1`, which is invalid for multi-byte UTF-8 punctuation (`:` is 3 bytes). The substring enumeration now tracks `run_starts` / `run_ends` separately and advances by `c.len_utf8()`. - The separator fragment emitted by `line_to_fragments` hardcoded `total_cell_width = 1`. Full-width punctuation visually occupies two cells, so this is now `UnicodeWidthChar::width(cell.c)`. ## Linked Issue Fixes warpdotdev#10245. - [ ] The linked issue is labeled `ready-to-spec` or `ready-to-implement`. ## Screenshots / Videos n/a — the existing unit tests in `app/src/util/link_detection_test.rs` cover the failing repros. ## Testing - New `test_possible_file_paths_in_word_cjk_punctuation` covers the issue's full table — fullwidth colon, fullwidth parentheses, CJK corner brackets, ideographic full stop, fullwidth comma — plus a negative case asserting that CJK letters in path names don't fragment the path. - The pre-existing `test_possible_file_paths_in_word_multibyte` still passes (CJK letters remain non-separators). - All 80 `terminal::model::grid::grid_handler::tests` and all 7 `util::link_detection::tests` pass under `cargo test --features local_fs --test-threads=1`. - Manually verified end-to-end: built `target/debug/warp-oss`, ran `printf '路径:/tmp/warp-repro.md\n路径: /tmp/warp-repro.md\n/tmp/warp-repro.md\n'`, and confirmed Cmd+Click works on all three lines (only the latter two worked before). ## Agent Mode - [ ] Warp Agent Mode - This PR was created via Warp's AI Agent Mode <!-- CHANGELOG-BUG-FIX: Fixed clickable file path detection failing when a path was directly preceded by CJK or full-width punctuation (e.g. `路径:/path/to/file`). -->
Description
Clickable file path detection in the terminal grid relied on a hardcoded ASCII separator set, so a path immediately preceded by CJK / full-width punctuation was not detected as clickable. This affects CJK prose and AI CLI output like:
This change replaces every
FILE_LINK_SEPARATORS.contains(&c)site with a newis_file_link_separator(c)helper that also accepts non-ASCII Unicode whitespace and general categoriesPo | Ps | Pe | Pi | Pf. Connectors (Pc), dashes (Pd), and CJK letters (Lo) are deliberately excluded so paths containing_,-, or CJK characters in their names (e.g./path/音楽/テスト.txt) continue to detect as a single token.Two latent issues that became visible once multi-byte separators are accepted are also fixed:
possible_file_paths_in_wordindexed past a separator with+ 1, which is invalid for multi-byte UTF-8 punctuation (:is 3 bytes). The substring enumeration now tracksrun_starts/run_endsseparately and advances byc.len_utf8().line_to_fragmentshardcodedtotal_cell_width = 1. Full-width punctuation visually occupies two cells, so this is nowUnicodeWidthChar::width(cell.c).Linked Issue
Fixes #10245.
ready-to-specorready-to-implement.Screenshots / Videos
n/a — the existing unit tests in
app/src/util/link_detection_test.rscover the failing repros.Testing
test_possible_file_paths_in_word_cjk_punctuationcovers the issue's full table — fullwidth colon, fullwidth parentheses, CJK corner brackets, ideographic full stop, fullwidth comma — plus a negative case asserting that CJK letters in path names don't fragment the path.test_possible_file_paths_in_word_multibytestill passes (CJK letters remain non-separators).terminal::model::grid::grid_handler::testsand all 7util::link_detection::testspass undercargo test --features local_fs --test-threads=1.target/debug/warp-oss, ranprintf '路径:/tmp/warp-repro.md\n路径: /tmp/warp-repro.md\n/tmp/warp-repro.md\n', and confirmed Cmd+Click works on all three lines (only the latter two worked before).Agent Mode