- Android Voice Assistant Application
This is an example Android application featuring an integrated voice assistant. It utilizes Speech-to-Text (STT) and Large Language Models (LLM) to process user voice commands, convert them into text, and generate intelligent responses. Android Text-to-Speech API is then used to produce a voice response. This application demonstrates a complete voice interaction pipeline for Android and by default uses KleidiAI library for optimized performance on Arm® CPU.
- Download and install the latest version of Android Studio
- Install the Android NDK. This project was tested with Android NDK r29.
- Python 3 must be installed. It is used to push resources and model files to the device.
This application is dependent on two modules which are downloaded during the build:
- STT - speech to text module which is used to transform user's audio prompt into a text representation
- all required build configurations are located in the stt directory
- whisper.cpp is used for this module
- the git repository and revision of downloaded module can be seen in CMakeLists.txt
- specific build flags and build variants for this module can be seen in build.gradle.kts
- LLM - large language models module which is used for prompt answering part of the pipeline
- all needed build configurations inside llm directory
- llama.cpp is used for this module
- the git repository and revision of downloaded module can be seen in CMakeLists.txt
- specific build flags and build variants for this module can be seen in build.gradle.kts
The modules are downloaded into the stt/ and llm/ directories and built as Android libraries.
NOTE: The modules require cmake version 3.27 or above.
Download the needed cmake version. Create a
local.propertiesfile in the root directory of the repository and specify the CMake path as follows:cmake.dir=<location-of-cmake-install>
This application contains three parts which can be explained in more detail.
Speech-to-Text is also known as Automatic Speech Recognition. This part of the pipeline focuses on converting spoken language into written text. Speech recognition is done in the following stages:
- device the microphone captures spoken language as an audio waveform
- audio waveform is broken into small timeframes, features are extracted from it to represent sound
- neural network is used to predict the most likely transcription of audio based on grammar and context
- final recognized text is generated and used for the next stage of the pipeline
Large Language Models (LLMs) are designed for natural language understanding, and in this application, they are used for question-answering. The text transcription from previous part of the pipeline is used as an input to the neural model. During initialization, the application assigns a persona to the LLM to ensure a friendly and informative voice assistant experience. By default, the application uses asynchronous flow for this part of the pipeline, meaning that parts of response are collected as they become available. The application UI is updated with each new token and these are also used for final stage of the pipeline.
The application includes support for Visual Question Answering (VQA), enabling users to provide an image as input and subsequently query the model with natural language questions grounded in that visual context. To initiate VQA, the image must be uploaded prior to starting the voice recording. Upon upload, the image undergoes encoding via the integrated vision encoder, producing a set of visual embeddings. These embeddings are retained in the chat context until the context is explicitly reset, allowing for multi-turn interaction and follow-up queries based on the same image.
Currently, this part of the application pipeline is using Android Text-to-Speech API with some extra functionality in the application to ensure smooth and natural speech output. In synchronous mode, speech is only generated after the full response from LLM is received. By default, the application operates in asynchronous mode, where speech synthesis starts as soon as a sufficient portion of the response (such as a half or full sentence) is available. Any additional responses are queued for processing by the Android Text-to-Speech engine.
The default KleidiAI configuration is ABI-specific:
- arm64-v8a: KleidiAI is enabled by default.
- x86_64: KleidiAI is disabled by default.
To override these defaults, simply adjust the build flag:
- To disable KleidiAI, use
-PkleidiAI=false. - To enable KleidiAI on an ABI where it is disabled by default, use
-PkleidiAI=true.
The application supports multiple LLM backend frameworks. You can choose the desired backend at build
time using the llmFramework Gradle property.
Available options:
llama.cpp(default)onnxruntime-genaimnnmediapipe
You can specify the framework when building the app from the command line:
./gradlew assembleRelease -PllmFramework=onnxruntime-genai
If no value is provided, the default is used.
NOTE: The default value is defined in gradle properties and can be modified if a different framework is preferred by default.
NOTE: MediaPipe™ support is only for limited models. Please set your Hugging Face access token in your
.netrcfile. Refer to the custom configuration of mediapipe for more details.
Details on supported LLM models can be found here.
This application supports custom configuration of the LLM via a JSON-formatted file located here: app/src/model_configuration_files/{LLM Framework Name}{Text or Vision}ConfigUser.json.
Details on custom LLM configuration can be found in the links below:
Custom configuration of llama.cpp
Custom configuration of onnxruntime-genai
Custom configuration of mediapipe
In addition to the default settings, this application allows you to provide custom configuration parameters for the STT via a JSON-formatted file named app/src/model_configuration_files/whisperConfig.json. This file must contain the following mandatory keys:
printRealtime: Show live partial results on the console.printProgress: Display a progress bar while decoding.printTimeStamps: Add timecodes in front of each segment.printSpecial: Print special tokens such as <|nospeech|>.translate: Translate everything into English instead of transcribing.language: ISO code of the spoken language, or "auto" for detection.numThreads: Number of CPU threads Whisper may use.offsetMs: Skip this many milliseconds at the start of the audio.noContext: Don’t feed previous text back as context.singleSegment: Stop after the first (≈30 s) segment.
You only need to modify the values associated with these keys if you wish to customize the STT's behavior. Do not remove any of the keys, as they are mandatory for the configuration to work properly.
The STT and LLM modules automatically download the required neural network models during the build process. These models are then deployed to the device for processing voice commands and generating responses with a push resources script
This application has been tested on Google Pixel 8 Pro (Android 14) to ensure compatibility and performance.
The following ABIs (Application Binary Interfaces) are supported:
- arm64-v8a
- x86_64
The application has been built and tested using the following Android NDK r29 version. Other versions may work but have not been officially tested.
- Ensure that the correct CMake version is installed and configured in local.properties file.
- Verify that dependencies are correctly downloaded by checking the logs in Android Studio.
- If facing memory issues, try increasing heap size in gradle.properties.
- If the execution of the app is very slow, check the build variant used :
- debug build will not show good performance but useful for debugging and tests
- release build should be used for best performance
NOTE: The cancellation flow of the application is currently under testing. Further updates and improvements will follow.
- Arm® and KleidiAI™ are registered trademarks or trademarks of Arm® Limited (or its subsidiaries) in the US and/or elsewhere.
- MediaPipe™ and Android™ are trademarks of Google LLC.
This project is distributed under the software licenses in LICENSES directory.
This project also includes a number of other projects, please see the license sections below for additional details: