Skip to content

feat: respect prepend_bos and add return_input_tokens flag#1439

Open
MdSadiqMd wants to merge 1 commit into
TransformerLensOrg:mainfrom
MdSadiqMd:sadiq/record-prepend_bos
Open

feat: respect prepend_bos and add return_input_tokens flag#1439
MdSadiqMd wants to merge 1 commit into
TransformerLensOrg:mainfrom
MdSadiqMd:sadiq/record-prepend_bos

Conversation

@MdSadiqMd

Copy link
Copy Markdown

Description

The Bridge tokenizer is loaded with add_bos_token=True, diverging from the default HF tokenizer behavior. This causes two silent footguns:

  1. generate(prepend_bos=...) was ignored — the parameter existed but emitted a warning and fell back to cfg.default_prepend_bos, preventing users from passing prepend_bos=False to avoid double-BOS when using chat templates that embed <|begin_of_text|>.
  2. No way to inspect what tokens the model actually received — when generation silently prepended an unexpected BOS, there was no way to detect it from the generate() return value.

This PR fixes both issues and documents the BOS contract so users know when and why to set prepend_bos=False.

Changes:

  • generate() now respects prepend_bos — the parameter is forwarded to to_tokens() instead of being discarded with a warning. generate(text, prepend_bos=False) correctly strips tokenizer-auto-prepended BOS, which is the required pattern when the input is pre-formatted chat-template text that already contains a BOS token.
  • return_input_tokens flag added — when True, generate() returns (output, input_tokens) where input_tokens is the token tensor that was actually fed to the model (after BOS handling). Compatible with return_cache=True (returns 3-tuple (output, cache, input_tokens)).
  • "Tokenization notes" docstring updated — explicitly documents that model.tokenizer has add_bos_token=True and differs from the stock HF tokenizer, that direct .encode() prepends BOS, and shows the recommended prepend_bos=False pattern for chat-template text.

Fixes #1418

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Screenshots

N/A — code changes only. Verified manually with gpt2:

prepend_bos Token length Starts with BOS
True (default) 2 tokens Yes
False 1 token No

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@MdSadiqMd

Copy link
Copy Markdown
Author

@jlarson4 PR is up, please review it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

consequences of tokenizer divergence between TransformersLens and Huggingface Transformers

1 participant