Add speaker embeddings support#3855
Conversation
ec35a1a to
4f45830
Compare
|
Can you include the unit tests? |
| TtsServable(const mediapipe::T2sCalculatorOptions& nodeOptions, const std::string& graphPath) { | ||
| auto fsModelsPath = std::filesystem::path(nodeOptions.models_path()); | ||
| TtsServable(const std::string& modelDir, const std::string& targetDevice, const google::protobuf::RepeatedPtrField<mediapipe::T2sCalculatorOptions_SpeakerEmbeddings>& graphVoices, const std::string& pluginConfig, const std::string& graphPath) { | ||
| auto fsModelsPath = std::filesystem::path(modelDir); |
There was a problem hiding this comment.
can you move this constructor to cpp? this way you could move read_speaker_embedding to cpp as well.
| input.seekg(0, std::ios::beg); | ||
|
|
||
| // Check size is multiple of float | ||
| OPENVINO_ASSERT(buffer_size % sizeof(float) == 0, "File size is not a multiple of float size."); |
There was a problem hiding this comment.
i think it might be the first time we use openvino_assert
why are we introducing yet another way of reporting errors?
for now we have:
- ovms::Status
- absl::Status
- std::variant<Obj, std::string(error)>
- and now this... :(
| for (auto voice : graphVoices) { | ||
| if (!std::filesystem::exists(voice.path())) | ||
| throw std::runtime_error{"Requested voice speaker embeddings file does not exist."}; | ||
| voices[voice.name()] = read_speaker_embedding(voice.path()); |
There was a problem hiding this comment.
do we catch exceptions thrown inside read_speaker_embedding?
0baece7 to
aa8b340
Compare
|
|
||
| Instead of generating speech with default model voice you can create speaker embeddings with [this script](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/create_speaker_embedding.py) | ||
| ```bash | ||
| curl --output create_speaker_embedding.py "https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/master/samples/python/speech_generation/create_speaker_embedding.py" |
There was a problem hiding this comment.
yet another link for manual change, it wouldnt even hit sed command which i use because its genai and master
do we want to keep this reference to master and risk of change and getting outdated demos? or do we want to hardcode it? or maybe do we want to keep it in our repo?
| if [ -f "$1/$TTS_MODEL/$TOKENIZER_FILE" ]; then | ||
| echo "Model file $1/$TTS_MODEL/$TOKENIZER_FILE exists. Skipping downloading models." | ||
| else | ||
| python3 demos/common/export_models/export_model.py text2speech --source_model "$TTS_MODEL" --weight-format int4 --model_repository_path $1 --vocoder microsoft/speecht5_hifigan |
There was a problem hiding this comment.
what about windows script?
| "input": "The quick brown fox jumped over the lazy dog." | ||
| } | ||
| )"; | ||
| ASSERT_EQ( |
There was a problem hiding this comment.
tests check only "OK" and no errors, but what about actual output?
There was a problem hiding this comment.
Pull request overview
This PR adds speaker embeddings support to the text-to-speech (TTS) functionality, allowing users to customize the voice used for speech generation by providing speaker embedding files.
Changes:
- Added configuration and validation for speaker embeddings in TTS calculator
- Refactored test infrastructure to reuse common test base class across multiple test files
- Added comprehensive test coverage for TTS functionality including speaker embeddings
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| windows_prepare_llm_models.bat | Added TTS model download configuration with vocoder |
| src/test/test_http_utils.hpp | Extracted V3HttpTest base class for reuse across test files |
| src/test/reranknode_test.cpp | Refactored to use extracted V3HttpTest base class |
| src/test/embeddingsnode_test.cpp | Refactored to use extracted V3HttpTest base class |
| src/test/audio/text2speech_test.cpp | Added comprehensive tests for TTS functionality including speaker embeddings |
| src/test/audio/graph.pbtxt | Added test graph configuration for TTS with speaker embeddings |
| src/test/audio/config.json | Added test configuration for TTS mediapipe graph |
| src/mediapipe_internal/mediapipegraphdefinition.cpp | Added error handling for TTS servable initialization |
| src/audio/text_to_speech/t2s_servable.hpp | Refactored to use constructor parameters instead of inline initialization |
| src/audio/text_to_speech/t2s_servable.cpp | Implemented constructor with speaker embeddings loading logic |
| src/audio/text_to_speech/t2s_calculator.proto | Added SpeakerEmbeddings message definition |
| src/audio/text_to_speech/t2s_calculator.cc | Added voice parameter handling and speaker embeddings usage |
| src/audio/text_to_speech/BUILD | Updated build dependencies for new implementation |
| src/BUILD | Added text2speech_test to build targets |
| prepare_llm_models.sh | Added TTS model download script for Linux |
| demos/audio/README.md | Added documentation for speaker embeddings feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| voice = voiceIt->value.GetString(); | ||
| } | ||
| if (voice.has_value()) { | ||
| if (pipe->voices.find(voice.value()) == pipe->voices.end()) |
There was a problem hiding this comment.
Are voices static at this point? Don't we need to lock it for the thread as well for the access?
There was a problem hiding this comment.
answer is yes, they are static, loaded once during pipeline initialization
| throw std::runtime_error("File size is not a multiple of float size."); | ||
| } | ||
| size_t num_floats = buffer_size / sizeof(float); | ||
| if (num_floats != 512) { |
There was a problem hiding this comment.
Why 512?
Can we have more context for this number for future reference?
Is it like embedding dim size or something?
| if (streamIt != payload.parsedJson->MemberEnd()) { | ||
| return absl::InvalidArgumentError("streaming is not supported"); | ||
| } | ||
| std::optional<std::string> voice; |
There was a problem hiding this comment.
| std::optional<std::string> voice; | |
| std::optional<std::string> voiceName; |
?
To distinguish from pipe->voices elements
| if (voiceIt != payload.parsedJson->MemberEnd() && voiceIt->value.IsString()) { | ||
| voice = voiceIt->value.GetString(); | ||
| } | ||
| if (voice.has_value()) { |
There was a problem hiding this comment.
Do we need separate condition here? Can we bring content from this block to the condition above?
if (voiceIt != payload.parsedJson->MemberEnd() && voiceIt->value.IsString()) I suppose it's the same.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
a93b865 to
2452659
Compare
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``