Conversation
a35c001 to
aa6e11c
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
aa6e11c to
5c4f07b
Compare
5c4f07b to
d247a20
Compare
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
shuningjin
left a comment
There was a problem hiding this comment.
Thank you! I have some minor comment. As a follow up, we could try placing ngram-hash-mapping globally for efficiency: either (1) at model level or (2) at data pipeline level, treat it as a special tokenizer.
| engram_vocab_bases=config.engram_vocab_sizes, | ||
| max_ngram_size=config.engram_max_ngram_size, | ||
| engram_num_heads=config.engram_num_heads, | ||
| layer_ids=[layer_idx], |
There was a problem hiding this comment.
This should be initialized with global layer_ids, as in unit test. Otherwise hashing behavior is different.
| # List of vocab sizes for each n-gram order. | ||
| engram_vocab_sizes: [] |
There was a problem hiding this comment.
(1) This config can be confusing: This is not the actual size, but the search start for prime sizes.
For accuracy, I suggest: engram_vocab_bases, and # List of minimum head vocab sizes for each n-gram order in 2...engram_max_ngram_size.
For example, if the max_ngram = 2, num_head=3, engram_vocab_bases = [4]
- ngram2, head0, find_next_unseen_prime > 4-1 -- size = 5
- ngram2, head1, find_next_unseen_prime > 5 --- size = 7
- ngram2, head2, find_next_unseen_prime > 7 --- size = 11
(2) to reflect the hierarchy (engram -> 2...max ngram -> num_head per ngram), the order could be:
engram_max_ngram_size
engram_num_heads, engram_head_dim
engram_vocab_sizes
| and self.gradient_accumulation_steps > 1 | ||
| ): | ||
| raise ValueError("FP8 quantization is not compatible with gradient accumulation.") | ||
| if self.engram_layers: |
There was a problem hiding this comment.
Thanks for adding the checks! Might be good to check len(engram_vocab_sizes) == (engram_max_ngram_size -1)
| @@ -181,10 +182,14 @@ def __init__( | |||
| self.compressed_tokenizer = CompressedTokenizer(tokenizer) | |||
| self.tokenizer_vocab_size = len(self.compressed_tokenizer) | |||
| if pad_id is not None: | |||
There was a problem hiding this comment.
When pad_id is not none, we init self.pad_id, which is required by _get_ngram_hashes for padding.
When it is None, it is unclear from reference
https://github.com/deepseek-ai/Engram/blob/fb7f84a21f91223715394a33a1dc24bbfb7f788e/engram_demo_v1.py#L211
Maybe raise error, and add a todo in comment?
| # TODO(ranran): Refactor NgramHashMapping to initialize once globally or at the model level. | ||
| # Moving this to decoders.py currently causes JAX initialization errors. |
There was a problem hiding this comment.
Alternatively, can we initialize it in data pipeline? treating it as a special tokenizer, operating with np and cpu, similar to existing hf_tokenizer
There was a problem hiding this comment.
Yeah, I tried both in this draft PR:
- put in the data pipeline: https://screenshot.googleplex.com/5dhowKp7EjzMuud - not working.
- put in the model level: https://screenshot.googleplex.com/5FW7nxAXu9fFDCA - not working.
Met different issues here and there. I think if bandwidth allows, we could dive deep more.
Description
Integrate Engram feature into a custom model
base.ymlandtype.py, integrates withdeepseek-custommodeljnpandnpas some operations running on CPU to avoid ConcretizationTypeError and other issues.Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.