Skip to content

Arm-Examples/Real-Time-Voice-Assistant

Repository files navigation

Android Voice Assistant Application


Introduction

This is an example Android application featuring an integrated voice assistant. It utilizes Speech-to-Text (STT) and Large Language Models (LLM) to process user voice commands, convert them into text, and generate intelligent responses. Android Text-to-Speech API is then used to produce a voice response. This application demonstrates a complete voice interaction pipeline for Android and by default uses KleidiAI library for optimized performance on Arm® CPU.

Pre-requisites

  1. Download and install the latest version of Android Studio
  2. Install the Android NDK. This project was tested with Android NDK r29.
  3. Python 3 must be installed. It is used to push resources and model files to the device.

Dependencies

This application is dependent on two modules which are downloaded during the build:

  • STT - speech to text module which is used to transform user's audio prompt into a text representation
    • all required build configurations are located in the stt directory
    • whisper.cpp is used for this module
    • the git repository and revision of downloaded module can be seen in CMakeLists.txt
    • specific build flags and build variants for this module can be seen in build.gradle.kts
  • LLM - large language models module which is used for prompt answering part of the pipeline
    • all needed build configurations inside llm directory
    • llama.cpp is used for this module
    • the git repository and revision of downloaded module can be seen in CMakeLists.txt
    • specific build flags and build variants for this module can be seen in build.gradle.kts

The modules are downloaded into the stt/ and llm/ directories and built as Android libraries.

NOTE: The modules require cmake version 3.27 or above.

Download the needed cmake version. Create a local.properties file in the root directory of the repository and specify the CMake path as follows:

cmake.dir=<location-of-cmake-install>

Application pipeline

This application contains three parts which can be explained in more detail.

Speech to Text Library

Speech-to-Text is also known as Automatic Speech Recognition. This part of the pipeline focuses on converting spoken language into written text. Speech recognition is done in the following stages:

  • device the microphone captures spoken language as an audio waveform
  • audio waveform is broken into small timeframes, features are extracted from it to represent sound
  • neural network is used to predict the most likely transcription of audio based on grammar and context
  • final recognized text is generated and used for the next stage of the pipeline

Large Language Models Library

Large Language Models (LLMs) are designed for natural language understanding, and in this application, they are used for question-answering. The text transcription from previous part of the pipeline is used as an input to the neural model. During initialization, the application assigns a persona to the LLM to ensure a friendly and informative voice assistant experience. By default, the application uses asynchronous flow for this part of the pipeline, meaning that parts of response are collected as they become available. The application UI is updated with each new token and these are also used for final stage of the pipeline.

Visual Question Answering (Optional)

The application includes support for Visual Question Answering (VQA), enabling users to provide an image as input and subsequently query the model with natural language questions grounded in that visual context. To initiate VQA, the image must be uploaded prior to starting the voice recording. Upon upload, the image undergoes encoding via the integrated vision encoder, producing a set of visual embeddings. These embeddings are retained in the chat context until the context is explicitly reset, allowing for multi-turn interaction and follow-up queries based on the same image.

Text to Speech Component

Currently, this part of the application pipeline is using Android Text-to-Speech API with some extra functionality in the application to ensure smooth and natural speech output. In synchronous mode, speech is only generated after the full response from LLM is received. By default, the application operates in asynchronous mode, where speech synthesis starts as soon as a sufficient portion of the response (such as a half or full sentence) is available. Any additional responses are queued for processing by the Android Text-to-Speech engine.

KleidiAI Configuration

The default KleidiAI configuration is ABI-specific:

  • arm64-v8a: KleidiAI is enabled by default.
  • x86_64: KleidiAI is disabled by default.

To override these defaults, simply adjust the build flag:

  • To disable KleidiAI, use -PkleidiAI=false.
  • To enable KleidiAI on an ABI where it is disabled by default, use -PkleidiAI=true.

LLM Framework

The application supports multiple LLM backend frameworks. You can choose the desired backend at build time using the llmFramework Gradle property.

Available options:

  • llama.cpp (default)
  • onnxruntime-genai
  • mnn
  • mediapipe

You can specify the framework when building the app from the command line:

./gradlew assembleRelease -PllmFramework=onnxruntime-genai

If no value is provided, the default is used.

NOTE: The default value is defined in gradle properties and can be modified if a different framework is preferred by default.

NOTE: MediaPipe™ support is only for limited models. Please set your Hugging Face access token in your .netrc file. Refer to the custom configuration of mediapipe for more details.

Details on supported LLM models can be found here.

Custom LLM Configuration

This application supports custom configuration of the LLM via a JSON-formatted file located here: app/src/model_configuration_files/{LLM Framework Name}{Text or Vision}ConfigUser.json.

Details on custom LLM configuration can be found in the links below:

Custom configuration of llama.cpp

Custom configuration of onnxruntime-genai

Custom configuration of mnn

Custom configuration of mediapipe

Custom STT Configuration

In addition to the default settings, this application allows you to provide custom configuration parameters for the STT via a JSON-formatted file named app/src/model_configuration_files/whisperConfig.json. This file must contain the following mandatory keys:

  • printRealtime: Show live partial results on the console.
  • printProgress: Display a progress bar while decoding.
  • printTimeStamps: Add timecodes in front of each segment.
  • printSpecial: Print special tokens such as <|nospeech|>.
  • translate: Translate everything into English instead of transcribing.
  • language: ISO code of the spoken language, or "auto" for detection.
  • numThreads: Number of CPU threads Whisper may use.
  • offsetMs: Skip this many milliseconds at the start of the audio.
  • noContext: Don’t feed previous text back as context.
  • singleSegment: Stop after the first (≈30 s) segment.

You only need to modify the values associated with these keys if you wish to customize the STT's behavior. Do not remove any of the keys, as they are mandatory for the configuration to work properly.

Resources

The STT and LLM modules automatically download the required neural network models during the build process. These models are then deployed to the device for processing voice commands and generating responses with a push resources script

Tested Devices

This application has been tested on Google Pixel 8 Pro (Android 14) to ensure compatibility and performance.

Supported ABIs

The following ABIs (Application Binary Interfaces) are supported:

  • arm64-v8a
  • x86_64

Supported NDK Versions

The application has been built and tested using the following Android NDK r29 version. Other versions may work but have not been officially tested.

Troubleshooting

  • Ensure that the correct CMake version is installed and configured in local.properties file.
  • Verify that dependencies are correctly downloaded by checking the logs in Android Studio.
  • If facing memory issues, try increasing heap size in gradle.properties.
  • If the execution of the app is very slow, check the build variant used :
    • debug build will not show good performance but useful for debugging and tests
    • release build should be used for best performance

Known issues

NOTE: The cancellation flow of the application is currently under testing. Further updates and improvements will follow.

Trademarks

  • Arm® and KleidiAI™ are registered trademarks or trademarks of Arm® Limited (or its subsidiaries) in the US and/or elsewhere.
  • MediaPipe™ and Android™ are trademarks of Google LLC.

License

This project is distributed under the software licenses in LICENSES directory.

This project also includes a number of other projects, please see the license sections below for additional details:

Arm-Examples / LLM-Runner

Arm-Examples / STT-Runner

About

CMake-based real-time voice assistant example demonstrating Arm® KleidiAI™ acceleration for LLM and speech pipelines.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors