Add half-float (FP16) storage support for vectors#15549
Add half-float (FP16) storage support for vectors#15549Pulkitg64 wants to merge 20 commits intoapache:mainfrom
Conversation
|
@Pulkitg64 the latency is the main concern IMO. We must copy the vectors onto heap (we know this is expensive), transform the bytes to I wonder if all the cost is spent just decoding the vector? What does a flame graph tell you? Also, could you indicate your JVM, etc.? See this interesting jep update on the ever incubating vector API:
|
|
@Pulkitg64 also, thank you for doing an initial pass and benchmarking, its important data :D. I wonder if we want a true element type vs. a new format? The element type has indeed expanded its various uses, but for many of them, Float16 isn't that much different than float (e.g. you still likely query & index with This is just an idea. I am not 100% sold either way. Looking for discussion. |
|
You need https://bugs.openjdk.org/browse/JDK-8370691 for this one to be performant. |
|
Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR. Code in |
|
Thanks @benwtrent, @rmuir for such quick responses. Let me try to gather some more data to confirm if the conversion is driving the regression.
Trying now |
|
stop converting. use the native fp16 type (and vector type), otherwise code will be slow |
|
I don't have any good news right now, but sharing some progress, since it has been more than a week now so here it is: I tried using the JDK PR for float16 computation as suggested by @rmuir. For this I had to checkout the JDK and pull the PR locally and build that and use that for building lucene code. After doing all above stuffs, I ran benchmark below
Profiler output for float16:
Next Steps:
|
|
I looked at your commented-out code here and it doesn't seem to use Float16Vector class but is instead doing a bunch of conversions and scalar operations |
|
Hi @rmuir ,
Actually for the defaultUtilSupport (not using panama), I tried three different approaches (that's why other 2 are commented out), but the approach 1 gave the best performance in my benchmarks
The Float16Vector is used in PanamaVectorUtilSupport class for which we are seeing very bad performance as explained in my above comment. (Sorry for the confusion, the PR size makes it difficult to navigate). But please let me know if you meant something else in your comment. |
|
as i said, you aren't using the vector classes correctly |
|
Hi, |
|
Sorry, earlier I pasted wrong profiler output above for float32 implementation. This is the correct profiler output for float32: and this is for float16: |
|
I think I found the problem. I was running these benchmarks on m5.12x large machines. This instance doesn't support float16 intrinsic operations. So, I changed my instance to m7g.8x large machines and here are the results: I am seeing much better performance with float16 encoding now. The latency with float16 is still 50% higher than float32. Also I haven't implemented bulk scoring as well, so maybe that will help us in some latency . The indexing rate is improved by 10% (this maybe because of fast fetching of smaller vectors).
Next Steps:Understand the flame chart and try to further improve the float16 encoding benchmark runs. |
|
After adding support for bulk-scoring for float16 for dot-product score calculations, I am seeing improvement and better metrics than float32: Seeing improvement of about 10% in latency as well as indexing rate. Results:
|
|
These are the results with different quantizations: With quantization enabled, we are seeing similar performance as float32 which is bit surprising to me because I thought float16 would be slower. This is because in my code, I have to inflate the fp16 vector to fp32 while quantizing vector. I will check and confirm if there is no mistake in benchmark runs.
|
|
Since we are seeing good performance with FP16 now, I wanted to know what the path forward should be. The JDK PR for adding Float16Vector operations has not been merged yet, and even if it get merged, we likely cannot use it until the JDK 27 release (unless we have early access). So, until we have that support, should we add a native implementation of FP16 scoring? This would be easier after #15508 (which adds native support in Lucene) gets merged. Once JDK 27 is released, we can switch to the Java implementation for scoring. |
…rValues for fallback support
|
I have removed the panama implementation for now but we can add them later once we have access to Float16Vector operations in JDK. Below are the benchmark numbers with default implementation: Summary: Used 100k docs across all runs with force-merge. For no quantization case, we are seeing high regression in latency more than 100% (which is expected) but for quantization case we are seeing comparable latency. Indexing side, we are seeing regression in indexing time across all runs (irrespective if quantization is enabled or not). This is also expected because for quantization runs, we have to do extra conversion from fp16 to fp32 for quantizing vectors.
Next Steps:If we are okay with above performance numbers so should we go ahead with this PR which is adding float16 vectorEncoding support without panama implementation or should we park this PR and wait for JDK27 release? |
|
Thanks @Pulkitg64, this is a very exciting change. It's frustrating to receive fp16 vectors (at Amazon Customer facing product search team) for indexing and have to fluff them up to fp32 entirely, before then quantizing them down to more sane (1,2,4,8 bit) bits per dim. And because these fluffy vectors take 2X the storage they really should have, we build ways to drop them from read-only replica indices. It would be so much better if Lucene could handle incoming vectors entirely as their original fp16 form (this PR). So, it's JDK 27 which will introduce Panama access to fp16 SIMD capabilities? And modern CPUs generally have good supports for fp16? And today (pre-JDK27) this PR must emulate (simple java code) the fp16 operations? And that's why it's slower? If we enabled users to swap in their own I haven't looked closely at the code changes yet ... just trying to get a grip on the high level situation. Thanks @Pulkitg64. |
|
It seems as if this PR has merit as-is without any special SIMD support either from JDK or via custom vector supprot provider implementation because, at least in the case where we are quantizing the vectors, which I think will be the default at this point for most applications, the performance is on-par with using fp32, maybe a little better, and we can reduce storage requirements, and also enable ingesting fp16 vectors without casting. |
msokolov
left a comment
There was a problem hiding this comment.
I got part way through reviewing -- there's a lot of code here! One thing I didn't like is all the if statements we had to add to Lucene104ScalarQuantizedVectorsWriter. I see why it's needed but can't help wishing there was a neater way to organize the branching on float16/float32 implementations. Maybe simply factoring out those conditionals into private utility methods for better readability? Or propagating generics further into the class hierarchy, although I'm a little afraid of where that could lead.
| } | ||
|
|
||
| @Benchmark | ||
| public float shortDotProductScalar() { |
There was a problem hiding this comment.
This treats the short[] as an array of fp16 right? Maybe we should change the name of the method to `fp16DotProductScalar1
| public Float16VectorValues getFloat16VectorValues(String field) throws IOException { | ||
| FieldInfo info = readState.fieldInfos.fieldInfo(field); | ||
| if (info == null) { | ||
| // mirror the handling in Lucene90VectorReader#getVectorValues |
There was a problem hiding this comment.
This comment seems out of date since Lucene90VectorReader no longer exists. I'd simply delete the whole comment
There was a problem hiding this comment.
I think I have copied this from getFloatVectorValues. I will fix it in next revision.
| } | ||
| FieldEntry fieldEntry = fieldEntries.get(info.number); | ||
| if (fieldEntry == null) { | ||
| // mirror the handling in Lucene90VectorReader#getVectorValues |
| if (fieldData.fieldInfo.getVectorEncoding() == VectorEncoding.FLOAT32) { | ||
| corrections = | ||
| scalarQuantizer.scalarQuantize( | ||
| (float[]) fieldData.getVectors().get(i), |
There was a problem hiding this comment.
hmm, even using generics we still have a cast here -- I wonder if the introduction of generics is worth it
| public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException { | ||
| if (!fieldInfo.getVectorEncoding().equals(VectorEncoding.FLOAT32)) { | ||
| VectorEncoding vectorEncoding = fieldInfo.getVectorEncoding(); | ||
| if (vectorEncoding != VectorEncoding.FLOAT32 && vectorEncoding != VectorEncoding.FLOAT16) { |
There was a problem hiding this comment.
Maybe we could introduce a method VectorEncoding.isFloatingPoint()?
| } | ||
|
|
||
| static class FieldWriter extends FlatFieldVectorsWriter<float[]> { | ||
| private abstract static class FieldWriter<T> extends FlatFieldVectorsWriter<T> { |
There was a problem hiding this comment.
ah I see, we already had generics so we're almost forced to use it here now
| public List<float[]> getVectors() { | ||
| return flatFieldVectorsWriter.getVectors(); | ||
| @SuppressWarnings("unchecked") | ||
| static FieldWriter<?> create( |
There was a problem hiding this comment.
would it help to have co-varying with FlatFieldVectorsWriter<T> instead of <?>?
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |

Description
This draft PR explores storing float vectors using 2 bytes (half-float/FP16) instead of 4 bytes (FP32), reducing vector disk usage by approximately 50%. The approach involves storing vectors on disk in half-float format while converting them back to full-float precision for dot-product computations during search and index merge operations. However, this conversion step introduces additional overhead during vector reads, resulting in slower indexing and search performance.
This is an early draft to gather community feedback on the viability and direction of this implementation..
TODO : Support for MemorySegmentVectorScorer with half-float vectors is yet to be implemented.
For no quantization, we are seeing around 100% increase in latency. For 8bit quantization, we are not seeing latency regression but for 4 bit we are seeing about 18% latency regression. We are seeing 20-25% drop in indexing rate across all quantization.