Skip to content

Detect file paths preceded by Unicode/CJK punctuation#10250

Merged
vorporeal merged 1 commit intowarpdotdev:masterfrom
SagarSDagdu:fix/path-detection-cjk-fullwidth-punctuation
May 8, 2026
Merged

Detect file paths preceded by Unicode/CJK punctuation#10250
vorporeal merged 1 commit intowarpdotdev:masterfrom
SagarSDagdu:fix/path-detection-cjk-fullwidth-punctuation

Conversation

@SagarSDagdu
Copy link
Copy Markdown
Contributor

Description

Clickable file path detection in the terminal grid relied on a hardcoded ASCII separator set, so a path immediately preceded by CJK / full-width punctuation was not detected as clickable. This affects CJK prose and AI CLI output like:

路径:/Users/me/project/plan.md   ← was NOT clickable (fullwidth :touching /)
路径: /Users/me/project/plan.md   ← clickable (ASCII : + space)
/Users/me/project/plan.md         ← clickable (bare)

This change replaces every FILE_LINK_SEPARATORS.contains(&c) site with a new is_file_link_separator(c) helper that also accepts non-ASCII Unicode whitespace and general categories Po | Ps | Pe | Pi | Pf. Connectors (Pc), dashes (Pd), and CJK letters (Lo) are deliberately excluded so paths containing _, -, or CJK characters in their names (e.g. /path/音楽/テスト.txt) continue to detect as a single token.

Two latent issues that became visible once multi-byte separators are accepted are also fixed:

  • possible_file_paths_in_word indexed past a separator with + 1, which is invalid for multi-byte UTF-8 punctuation ( is 3 bytes). The substring enumeration now tracks run_starts / run_ends separately and advances by c.len_utf8().
  • The separator fragment emitted by line_to_fragments hardcoded total_cell_width = 1. Full-width punctuation visually occupies two cells, so this is now UnicodeWidthChar::width(cell.c).

Linked Issue

Fixes #10245.

  • The linked issue is labeled ready-to-spec or ready-to-implement.

Screenshots / Videos

n/a — the existing unit tests in app/src/util/link_detection_test.rs cover the failing repros.

Testing

  • New test_possible_file_paths_in_word_cjk_punctuation covers the issue's full table — fullwidth colon, fullwidth parentheses, CJK corner brackets, ideographic full stop, fullwidth comma — plus a negative case asserting that CJK letters in path names don't fragment the path.
  • The pre-existing test_possible_file_paths_in_word_multibyte still passes (CJK letters remain non-separators).
  • All 80 terminal::model::grid::grid_handler::tests and all 7 util::link_detection::tests pass under cargo test --features local_fs --test-threads=1.
  • Manually verified end-to-end: built target/debug/warp-oss, ran printf '路径:/tmp/warp-repro.md\n路径: /tmp/warp-repro.md\n/tmp/warp-repro.md\n', and confirmed Cmd+Click works on all three lines (only the latter two worked before).

Agent Mode

  • Warp Agent Mode - This PR was created via Warp's AI Agent Mode

@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented May 6, 2026

@SagarSDagdu

I'm starting a first review of this pull request.

You can view the conversation on Warp.

I reviewed this pull request and requested human review from: @warpdotdev/oss-maintainers.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@github-actions github-actions Bot added the external-contributor Indicates that a PR has been opened by someone outside the Warp team. label May 6, 2026
Copy link
Copy Markdown
Contributor

@oz-for-oss oz-for-oss Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR updates file-path boundary detection to recognize non-ASCII whitespace and selected Unicode punctuation, fixes candidate substring indexing for multi-byte separators, and uses Unicode display width for separator fragments.

Concerns

  • No blocking correctness, security, or performance concerns found in the annotated diff.

Verdict

Found: 0 critical, 0 important, 0 suggestions

Approve

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@oz-for-oss oz-for-oss Bot requested review from a team and vorporeal and removed request for a team May 6, 2026 10:53
@vorporeal
Copy link
Copy Markdown
Contributor

hey @SagarSDagdu - #10004 (merged earlier today) also expanded the set of file link separators and added general support for non-ASCII ones.

mind rebasing on top of those changes? i think it'll mostly simplify your PR down to is_file_link_separator() (which we'll want to also handle the unicode line drawing characters covered by that PR).

@SagarSDagdu SagarSDagdu force-pushed the fix/path-detection-cjk-fullwidth-punctuation branch from 570ca63 to 81f72dd Compare May 7, 2026 17:02
@SagarSDagdu
Copy link
Copy Markdown
Contributor Author

SagarSDagdu commented May 7, 2026

Rebased on master to pick up #10004's multi-byte indexing refactor that already covered the byte-range fix I'd duplicated, so this PR now reduces to:

  • New is_file_link_separator(c) helper that wraps the existing FILE_LINK_SEPARATORS set (ASCII + box-drawing) and additionally accepts non-ASCII Unicode whitespace plus general categories Po | Ps | Pe | Pi | Pf. Pc/Pd/Lo are deliberately excluded so _, -, and CJK letters don't fragment paths.
  • All 5 FILE_LINK_SEPARATORS.contains(&c) sites (4 in grid_handler, 1 in link_detection) now go through the helper, so the box-drawing chars from fix: detect tree output filenames as file links #10004 and the CJK punctuation from this PR share the same gate.
  • line_to_fragments now uses UnicodeWidthChar::width for the separator fragment (full-width punctuation occupies two cells).
  • New test_possible_file_paths_in_word_cjk_punctuation covers the issue's table plus a negative case for CJK letters in path names. The tree-output tests added in fix: detect tree output filenames as file links #10004 still pass.

Please review @vorporeal.

Copy link
Copy Markdown
Contributor

@vorporeal vorporeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great; thank you!

@vorporeal vorporeal enabled auto-merge (squash) May 7, 2026 18:05
@SagarSDagdu
Copy link
Copy Markdown
Contributor Author

SagarSDagdu commented May 8, 2026

@vorporeal I think you'll have to kickoff the workflows manually on this PR to merge it, as I had to resolve some conflicts

Clickable file path detection didn't recognize paths preceded by CJK /
full-width punctuation (`:`, `(`, `「`, etc.) — common in CJK prose and
AI CLI output like `路径:/path/to/file`.

Replaces the four `FILE_LINK_SEPARATORS.contains(&c)` sites in
`grid_handler` and the one in `link_detection` with a new
`is_file_link_separator` helper. The helper accepts the existing
ASCII-and-box-drawing set, plus any non-ASCII Unicode whitespace and
general categories `Po | Ps | Pe | Pi | Pf`. Connectors (`Pc`), dashes
(`Pd`), and CJK letters (`Lo`) are deliberately excluded so paths
containing `_`, `-`, or CJK characters in their names still detect as a
single token.

Also fixes a latent issue in `line_to_fragments`: the separator fragment
hardcoded `total_cell_width = 1`, but full-width separators visually
occupy two cells; uses `UnicodeWidthChar::width` instead.

Fixes warpdotdev#10245.
auto-merge was automatically disabled May 8, 2026 04:48

Head branch was pushed to by a user without write access

@SagarSDagdu SagarSDagdu force-pushed the fix/path-detection-cjk-fullwidth-punctuation branch 2 times, most recently from 41ef721 to d50a742 Compare May 8, 2026 06:05
@SagarSDagdu SagarSDagdu requested a review from vorporeal May 8, 2026 11:45
@vorporeal vorporeal enabled auto-merge (squash) May 8, 2026 15:55
@vorporeal vorporeal merged commit 68f6062 into warpdotdev:master May 8, 2026
28 of 36 checks passed
trungtai1805 pushed a commit to trungtai1805/warp that referenced this pull request May 9, 2026
## Description

Clickable file path detection in the terminal grid relied on a hardcoded
ASCII separator set, so a path immediately preceded by CJK / full-width
punctuation was not detected as clickable. This affects CJK prose and AI
CLI output like:

```
路径:/Users/me/project/plan.md   ← was NOT clickable (fullwidth :touching /)
路径: /Users/me/project/plan.md   ← clickable (ASCII : + space)
/Users/me/project/plan.md         ← clickable (bare)
```

This change replaces every `FILE_LINK_SEPARATORS.contains(&c)` site with
a new `is_file_link_separator(c)` helper that also accepts non-ASCII
Unicode whitespace and general categories `Po | Ps | Pe | Pi | Pf`.
Connectors (`Pc`), dashes (`Pd`), and CJK letters (`Lo`) are
deliberately excluded so paths containing `_`, `-`, or CJK characters in
their names (e.g. `/path/音楽/テスト.txt`) continue to detect as a single
token.

Two latent issues that became visible once multi-byte separators are
accepted are also fixed:

- `possible_file_paths_in_word` indexed past a separator with `+ 1`,
which is invalid for multi-byte UTF-8 punctuation (`:` is 3 bytes). The
substring enumeration now tracks `run_starts` / `run_ends` separately
and advances by `c.len_utf8()`.
- The separator fragment emitted by `line_to_fragments` hardcoded
`total_cell_width = 1`. Full-width punctuation visually occupies two
cells, so this is now `UnicodeWidthChar::width(cell.c)`.

## Linked Issue

Fixes warpdotdev#10245.

- [ ] The linked issue is labeled `ready-to-spec` or
`ready-to-implement`.

## Screenshots / Videos

n/a — the existing unit tests in `app/src/util/link_detection_test.rs`
cover the failing repros.

## Testing

- New `test_possible_file_paths_in_word_cjk_punctuation` covers the
issue's full table — fullwidth colon, fullwidth parentheses, CJK corner
brackets, ideographic full stop, fullwidth comma — plus a negative case
asserting that CJK letters in path names don't fragment the path.
- The pre-existing `test_possible_file_paths_in_word_multibyte` still
passes (CJK letters remain non-separators).
- All 80 `terminal::model::grid::grid_handler::tests` and all 7
`util::link_detection::tests` pass under `cargo test --features local_fs
--test-threads=1`.
- Manually verified end-to-end: built `target/debug/warp-oss`, ran
`printf '路径:/tmp/warp-repro.md\n路径:
/tmp/warp-repro.md\n/tmp/warp-repro.md\n'`, and confirmed Cmd+Click
works on all three lines (only the latter two worked before).

## Agent Mode
- [ ] Warp Agent Mode - This PR was created via Warp's AI Agent Mode

<!--
CHANGELOG-BUG-FIX: Fixed clickable file path detection failing when a
path was directly preceded by CJK or full-width punctuation (e.g.
`路径:/path/to/file`).
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed external-contributor Indicates that a PR has been opened by someone outside the Warp team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Path detection: CJK / full-width punctuation not treated as word boundary, paths become un-clickable

2 participants