Add half-float (FP16) storage support for vectors by Pulkitg64 · Pull Request #15549 · apache/lucene

Pulkitg64 · 2026-01-05T18:40:28Z

Description

This draft PR explores storing float vectors using 2 bytes (half-float/FP16) instead of 4 bytes (FP32), reducing vector disk usage by approximately 50%. The approach involves storing vectors on disk in half-float format while converting them back to full-float precision for dot-product computations during search and index merge operations. However, this conversion step introduces additional overhead during vector reads, resulting in slower indexing and search performance.

This is an early draft to gather community feedback on the viability and direction of this implementation..

TODO : Support for MemorySegmentVectorScorer with half-float vectors is yet to be implemented.

Benchmark Results:

For no quantization, we are seeing around 100% increase in latency. For 8bit quantization, we are not seeing latency regression but for 4 bit we are seeing about 18% latency regression. We are seeing 20-25% drop in indexing rate across all quantization.

Encoding	recall	latency(ms)	quantized	index(s)	index_docs/s	index_size(MB)	vec_disk(MB)	vec_RAM(MB)
float16	0.991	11.392	no	34.8	2873.81	206.22	390.625	390.625
float16	0.981	4.337	8 bits	41.55	2406.97	305.4	294.495	99.182
float16	0.926	6.069	4 bits	42.07	2376.93	256.58	245.667	50.354
float32	0.991	4.942	no	28.93	3456.38	401.53	390.625	390.625
float32	0.981	4.367	8 bits	32.04	3121.49	500.71	489.807	99.182
float32	0.926	5.343	4 bits	32.12	3113.33	451.91	440.979	50.354

benwtrent · 2026-01-05T19:07:33Z

@Pulkitg64 the latency is the main concern IMO. We must copy the vectors onto heap (we know this is expensive), transform the bytes to float32 (which is an additional cost), then do the float32 panama vector actions (which are super fast). I would expect this to also impact quantization query time for anything that must rescore (though, likely less of an impact as that would be fewer vectors to decode).

I wonder if all the cost is spent just decoding the vector? What does a flame graph tell you?

Also, could you indicate your JVM, etc.?

See this interesting jep update on the ever incubating vector API:

https://openjdk.org/jeps/508

Addition, subtraction, division, multiplication, square root, and fused multiply/add operations on Float16 values are now auto-vectorized on supporting x64 CPUs.

benwtrent · 2026-01-05T19:09:59Z

@Pulkitg64 also, thank you for doing an initial pass and benchmarking, its important data :D.

I wonder if we want a true element type vs. a new format?

The element type has indeed expanded its various uses, but for many of them, Float16 isn't that much different than float (e.g. you still likely query & index with float[], still use FloatVectorValues, etc.). The only difference is the on disk representation (which...seems like a format thing).

This is just an idea. I am not 100% sold either way. Looking for discussion.

rmuir · 2026-01-05T19:21:11Z

You need https://bugs.openjdk.org/browse/JDK-8370691 for this one to be performant.

rmuir · 2026-01-05T19:34:26Z

Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR.

Code in o.a.l.internal.vectorization will be needed that takes advantage of the new Float16Vector or whatever the name ends out being. I would try to keep it looking as close to the existing 32-bit float code as possible.

Pulkitg64 · 2026-01-06T14:55:18Z

Thanks @benwtrent, @rmuir for such quick responses.

Let me try to gather some more data to confirm if the conversion is driving the regression.

Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR.

Code in o.a.l.internal.vectorization will be needed that takes advantage of the new Float16Vector or whatever the name ends out being. I would try to keep it looking as close to the existing 32-bit float code as possible.

Trying now

Pulkitg64 · 2026-01-07T17:14:09Z

Here is the output difference from profiler between float16 and float32 benchmark runs for no quantization. Based on below comparison, it can be clearly seen the additional latency in float16 benchmark run is coming while reading float16 vectors.

Also, could you indicate your JVM, etc.?

I am running these test on x86 machine with JDK25

java --version
openjdk 25.0.1 2025-10-21 LTS
OpenJDK Runtime Environment Corretto-25.0.1.9.1 (build 25.0.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-25.0.1.9.1 (build 25.0.1+9-LTS, mixed mode, sharing)

rmuir · 2026-01-07T18:48:48Z

stop converting. use the native fp16 type (and vector type), otherwise code will be slow

Pulkitg64 · 2026-01-15T21:50:00Z

I don't have any good news right now, but sharing some progress, since it has been more than a week now so here it is:

I tried using the JDK PR for float16 computation as suggested by @rmuir. For this I had to checkout the JDK and pull the PR locally and build that and use that for building lucene code.
Then I tried adding new APIs to support Float16 vectors everywhere like introduced new KnnFloat16VectorField, KnnFloat16VectorQuery, Scorer etc. (I think I should not have wasted time to do this in first place but instead focus only on vectorScore implementation and it's benchmarks but I have learnt the lesson)

After doing all above stuffs, I ran benchmark below

With DefaultVectorUtilSupport (NumDocs: 100k)

For defaultVectorUtilSupport, I implemented below function for dot-product score, which is converting to float32 vectors first before doing any computation. With this I am seeing regression in the latency (around 100%) for no quantization, because of the extra conversion. For the quantization cases, the latency is comparable, but the indexing is way slower, I think this is again because of conversion of shorts to floats during vector quantization (I think I can try to optimize it).
```
@Override
public short dotProduct(short[] a, short[] b) {
 assert a.length == b.length : "Vector lengths must match";

 float sum = 0f;
 for (int i = 0; i < a.length; i++) {
   sum = Math.fma(
       Float.float16ToFloat(a[i]),
       Float.float16ToFloat(b[i]),
       sum
   );
 }
 return Float.floatToFloat16(sum);
}
```

Encoding	recall	latency(ms)	netCPU	avgCpuCount	quantized	visited	index(s)	index_docs/s	force_merge(s)	index_size(MB)	vec_disk(MB)	vec_RAM(MB)
float16	0.989	9.924	9.868	0.994	no	5659	75.67	1321.46	0	206.19	390.625	390.625
float16	0.981	4.911	4.884	0.994	8 bits	5680	82.27	1215.54	40.13	305.41	294.495	99.182
float16	0.926	5.885	5.849	0.994	4 bits	5727	82.98	1205.08	0	256.59	245.667	50.354
float32	0.991	5.123	5.103	0.996	no	5680	28.56	3501.16	44.64	401.53	390.625	390.625
float32	0.981	4.692	4.675	0.997	8 bits	5689	32.29	3097.13	52.05	500.71	489.807	99.182
float32	0.926	5.822	5.786	0.994	4 bits	5728	32.62	3065.79	64.55	451.9	440.979	50.354

With PanamaVectorUtilSupport (NumDocs: 10k only because 100k was taking too much time)

With Float16 Panama implementation, I am seeing very bad results (almost 40 times higher latency). I checked the profiler results and there is one JDK internal function call (VectorPayload.getPayload())which is taking the most time. I am yet to understand why that function call is taking too long

encoding	recall	latency(ms)	netCPU	avgCpuCount	visited	index(s)	index_docs/s	index_size(MB)
float16	0.996	35.076	137.727	3.927	16072	29.70	336.71	19.96
float32	0.998	0.865	3.139	3.630	16088	1.95	5117.71	39.49

Profiler output for float16:

PERCENT       CPU SAMPLES   STACK
66.03%        217295        jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code]
10.08%        33172         jdk.incubator.vector.Float16Vector#tOpTemplate() [Inlined code]
7.90%         25995         jdk.incubator.vector.Float16#valueOf() [Inlined code]
6.16%         20265         jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code]
5.07%         16691         jdk.incubator.vector.Float16#lambda$fma$0() [Inlined code]
1.54%         5084          jdk.incubator.vector.Float16Vector256#vectorFactory() [Inlined code]
0.40%         1323          jdk.incubator.vector.Float16#shortBitsToFloat16() [Inlined code]
0.35%         1160          jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code]
0.30%         988           jdk.internal.vm.vector.VectorSupport#ternaryOp() [JIT compiled code]
0.23%         757           jdk.jfr.internal.JVM#emitEvent() [Native code]
0.21%         683           jdk.internal.vm.vector.VectorSupport$VectorPayload#<init>() [Inlined code]
0.16%         523           jdk.incubator.vector.Float16Vector$$Lambda.0x000000003811dec0#apply() [Inlined code]
0.15%         506           jdk.incubator.vector.Float16Vector#bOpTemplate() [Inlined code]
0.10%         340           jdk.incubator.vector.Float16Vector256#vec() [Inlined code]
0.10%         333           org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
0.09%         302           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.09%         289           jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code]
0.09%         288           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.07%         231           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.05%         171           jdk.incubator.vector.Float16Vector#lambda$reductionOperations$1() [Inlined code]
0.03%         112           jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code]
0.03%         98            sun.nio.fs.UnixNativeDispatcher#open0() [Native code]
0.02%         82            java.util.TimSort#binarySort() [JIT compiled code]
0.02%         79            jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
0.02%         77            jdk.internal.vm.vector.VectorSupport#binaryOp() [JIT compiled code]
0.02%         73            sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.02%         60            org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts#readInts8() [Inlined code]
0.02%         59            org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() [JIT compiled code]
0.02%         56            org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.02%         52            org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]

~~Profiler output for float32:~~ (Incorrect Output, correct one shared here)

PERCENT       CPU SAMPLES   STACK
65.82%        679346        org.apache.lucene.internal.vectorization.DefaultVectorUtilSupport#fma() [Inlined code]
27.94%        288356        org.apache.lucene.internal.vectorization.DefaultVectorUtilSupport#dotProduct() [Inlined code]
1.31%         13488         org.apache.lucene.index.Float16VectorValues$1#vectorValue() [Inlined code]
0.36%         3739          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
0.36%         3677          org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code]
0.31%         3227          org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code]
0.30%         3098          org.apache.lucene.util.VectorUtil#dotProduct() [Inlined code]
0.30%         3075          org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code]
0.25%         2567          org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
0.20%         2055          java.util.Arrays#fill() [Inlined code]
0.15%         1594          org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code]
0.14%         1489          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.13%         1374          org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [JIT compiled code]
0.10%         1046          jdk.jfr.internal.JVM#emitEvent() [Native code]
0.09%         931           org.apache.lucene.codecs.lucene95.OffHeapFloat16VectorValues#vectorValue() [Inlined code]
0.09%         915           org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code]
0.08%         778           org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.07%         734           java.util.concurrent.locks.AbstractQueuedLongSynchronizer#apparentlyFirstQueuedIsExclusive() [Inlined code]
0.06%         615           java.util.ArrayList#elementData() [Inlined code]
0.06%         574           sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code]
0.05%         564           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.05%         514           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$Float16ScoringSupplier$1#score() [Inlined code]
0.05%         507           jdk.internal.foreign.AbstractMemorySegmentImpl#copy() [Inlined code]
0.05%         501           jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code]
0.05%         479           org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
0.04%         461           jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds() [Inlined code]
0.04%         426           org.apache.lucene.util.ArrayUtil#growExact() [Inlined code]
0.04%         414           org.apache.lucene.util.hnsw.HnswGraphSearcher#graphNextNeighbor() [Inlined code]
0.04%         412           org.apache.lucene.util.hnsw.NeighborArray#addOutOfOrder() [Inlined code]
0.04%         391           org.apache.lucene.util.packed.DirectMonotonicReader#get() [Inlined code]

Next Steps:

Try and understand, why there is so much regression with Float16 panama support and understand the profiler results better.

rmuir · 2026-01-15T22:15:25Z

I looked at your commented-out code here and it doesn't seem to use Float16Vector class but is instead doing a bunch of conversions and scalar operations

Pulkitg64 · 2026-01-16T05:39:59Z

Hi @rmuir ,

I looked at your commented-out code here and it doesn't seem to use Float16Vector class but is instead doing a bunch of conversions and scalar operations

Actually for the defaultUtilSupport (not using panama), I tried three different approaches (that's why other 2 are commented out), but the approach 1 gave the best performance in my benchmarks

Approach 1 (Best performance): In this I am converting the short values to float32 values, then passing the values to fma function.

JMH:
Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.shortDotProductScalar    1024  thrpt   15  0.753 ± 0.001  ops/us

Code: 
@Override
  public short dotProduct(short[] a, short[] b) {
    assert a.length == b.length : "Vector lengths must match";

    float sum = 0f;
    for (int i = 0; i < a.length; i++) {
      sum = Math.fma(
          Float.float16ToFloat(a[i]),
          Float.float16ToFloat(b[i]),
          sum
      );
    }
    return Float.floatToFloat16(sum);
  }

Approach 2: In this I am using Float16 class objects to assign values and use Float16.fma function for computation. In the implementation, the Float16 object is converted to float32 array internally. Hence I think this is not performant enough.

JMH:
Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.shortDotProductScalar    1024  thrpt   15  0.077 ±  0.001  ops/us

Code: 
@Override
public short dotProduct(short[] a, short[] b) {
    assert a.length == b.length : "Vector lengths must match";
   Float16 sum = Float16.valueOf(0);
    for (int i = 0; i < a.length; i++) {
    sum = Float16.fma(Float16.shortBitsToFloat16(a[i]), Float16.shortBitsToFloat16(b[i]), sum);
   }
    return sum.shortValue();
  }

Approach 3: This is extension to Approach 1 where I am trying to use loop unrolling, but I am not seeing any difference in performance.

JMH:
Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.shortDotProductScalar    1024  thrpt   15  0.748 ± 0.002  ops/us

Code
@Override
  public short dotProduct(short[] a, short[] b) {
    float res = 0f;
    int i = 0;

    // if the array is big, unroll it
    if (a.length > 32) {
      float acc1 = 0f;
      float acc2 = 0f;
      float acc3 = 0f;
      float acc4 = 0f;
      int upperBound = a.length & ~(4 - 1);
      for (; i < upperBound; i += 4) {
        acc1 = fma(Float.float16ToFloat(a[i]),     Float.float16ToFloat(b[i]),     acc1);
        acc2 = fma(Float.float16ToFloat(a[i + 1]), Float.float16ToFloat(b[i + 1]), acc2);
        acc3 = fma(Float.float16ToFloat(a[i + 2]), Float.float16ToFloat(b[i + 2]), acc3);
        acc4 = fma(Float.float16ToFloat(a[i + 3]), Float.float16ToFloat(b[i + 3]), acc4);
      }
      res += acc1 + acc2 + acc3 + acc4;
    }

    for (; i < a.length; i++) {
      res = fma(Float.float16ToFloat(a[i]), Float.float16ToFloat(b[i]), res);
    }
    return Float.floatToFloat16(res);
  }

Note:

The Float16Vector is used in PanamaVectorUtilSupport class for which we are seeing very bad performance as explained in my above comment. (Sorry for the confusion, the PR size makes it difficult to navigate). But please let me know if you meant something else in your comment.

rmuir · 2026-01-16T13:08:21Z

as i said, you aren't using the vector classes correctly

Pulkitg64 · 2026-01-20T03:08:38Z

Hi,
Perhaps I'm misunderstanding something, but based on the profiler results here, I think panama implementation is using Float16Vector class correctly. Could you point me to the function which you are referring to?

Pulkitg64 · 2026-01-20T03:25:26Z

Sorry, earlier I pasted wrong profiler output above for float32 implementation. This is the correct profiler output for float32:

39.85%        2457          jdk.incubator.vector.FloatVector#lanewiseTemplate() [Inlined code]
5.97%         368           jdk.incubator.vector.FloatVector#reduceLanesTemplate() [Inlined code]
5.43%         335           org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
4.61%         284           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
3.99%         246           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
2.74%         169           jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
1.43%         88            org.apache.lucene.document.StoredField#<init>() [Inlined code]
1.41%         87            org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() [JIT compiled code]
1.35%         83            java.util.TimSort#binarySort() [JIT compiled code]
1.12%         69            jdk.incubator.vector.FloatVector#fromArray0Template() [Inlined code]
1.10%         68            org.apache.lucene.internal.vectorization.Lucene99MemorySegmentFloatVectorScorer#bulkScore() [JIT compiled code]
0.96%         59            org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress() [JIT compiled code]
0.91%         56            sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.86%         53            org.apache.lucene.internal.vectorization.MemorySegmentBulkVectorOps$DotProduct#dotProductBulkImpl() [Interpreted code]
0.79%         49            org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.73%         45            org.apache.lucene.util.TernaryLongHeap#updateTop() [JIT compiled code]
0.68%         42            org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts#readInts8() [Inlined code]
0.65%         40            jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.63%         39            sun.nio.fs.UnixNativeDispatcher#open0() [Native code]
0.62%         38            org.apache.lucene.util.TernaryLongHeap#insertWithOverflow() [Inlined code]
0.57%         35            java.util.HashMap#resize() [JIT compiled code]
0.52%         32            org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
0.45%         28            org.apache.lucene.util.VectorUtil#normalizeToUnitInterval() [Inlined code]
0.44%         27            java.lang.Integer#valueOf() [Inlined code]
0.42%         26            org.apache.lucene.search.TaskExecutor#invokeAll() [JIT compiled code]
0.42%         26            jdk.jfr.internal.JVM#emitEvent() [Native code]
0.42%         26            java.util.zip.Inflater#inflateBytesBytes() [Native code]
0.42%         26            org.apache.lucene.util.hnsw.RandomVectorScorer$AbstractRandomVectorScorer#ordToDoc() [Inlined code]
0.42%         26            java.util.TimSort#mergeLo() [JIT compiled code]
0.41%         25            org.apache.lucene.util.hnsw.NeighborQueue#encode() [Inlined code]

and this is for float16:

66.03%        217295        jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code]
10.08%        33172         jdk.incubator.vector.Float16Vector#tOpTemplate() [Inlined code]
7.90%         25995         jdk.incubator.vector.Float16#valueOf() [Inlined code]
6.16%         20265         jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code]
5.07%         16691         jdk.incubator.vector.Float16#lambda$fma$0() [Inlined code]
1.54%         5084          jdk.incubator.vector.Float16Vector256#vectorFactory() [Inlined code]
0.40%         1323          jdk.incubator.vector.Float16#shortBitsToFloat16() [Inlined code]
0.35%         1160          jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code]
0.30%         988           jdk.internal.vm.vector.VectorSupport#ternaryOp() [JIT compiled code]
0.23%         757           jdk.jfr.internal.JVM#emitEvent() [Native code]
0.21%         683           jdk.internal.vm.vector.VectorSupport$VectorPayload#<init>() [Inlined code]
0.16%         523           jdk.incubator.vector.Float16Vector$$Lambda.0x000000003811dec0#apply() [Inlined code]
0.15%         506           jdk.incubator.vector.Float16Vector#bOpTemplate() [Inlined code]
0.10%         340           jdk.incubator.vector.Float16Vector256#vec() [Inlined code]
0.10%         333           org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
0.09%         302           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.09%         289           jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code]
0.09%         288           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.07%         231           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.05%         171           jdk.incubator.vector.Float16Vector#lambda$reductionOperations$1() [Inlined code]
0.03%         112           jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code]
0.03%         98            sun.nio.fs.UnixNativeDispatcher#open0() [Native code]
0.02%         82            java.util.TimSort#binarySort() [JIT compiled code]
0.02%         79            jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
0.02%         77            jdk.internal.vm.vector.VectorSupport#binaryOp() [JIT compiled code]
0.02%         73            sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.02%         60            org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts#readInts8() [Inlined code]
0.02%         59            org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() [JIT compiled code]
0.02%         56            org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.02%         52            org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]

Pulkitg64 · 2026-02-03T18:34:49Z

I think I found the problem. I was running these benchmarks on m5.12x large machines. This instance doesn't support float16 intrinsic operations. So, I changed my instance to m7g.8x large machines and here are the results:

I am seeing much better performance with float16 encoding now. The latency with float16 is still 50% higher than float32. Also I haven't implemented bulk scoring as well, so maybe that will help us in some latency . The indexing rate is improved by 10% (this maybe because of fast fetching of smaller vectors).

Encoding	recall	latency(ms)	netCPU	avgCpuCount	visited	index(s)	index_docs/s	force_merge(s)	index_size(MB)
float16	0.992	3.229	3.154	0.977	6820	17.01	5879.93	0.01	207.65
float32	0.990	2.111	2.066	0.978	6858	19.18	5214.04	22.81	403.03

Profiler for float16:

40.69%        82592         jdk.incubator.vector.Float16Vector#reduceLanesTemplate() [Inlined code]
20.50%        41612         org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [JIT compiled code]
5.41%         10983         jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code]
5.00%         10158         org.apache.lucene.index.Float16VectorValues$1#vectorValue() [Inlined code]
3.92%         7964          jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code]
2.21%         4488          jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code]
1.71%         3467          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
1.43%         2909          org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code]
1.19%         2408          org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
1.18%         2386          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.87%         1763          java.lang.invoke.VarHandleSegmentAsInts#get() [Inlined code]
0.85%         1722          org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code]
0.75%         1514          org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code]
0.63%         1278          org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code]
0.62%         1251          jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code]
0.61%         1247          org.apache.lucene.util.hnsw.HnswGraphBuilder#diversityCheck() [JIT compiled code]
0.47%         961           java.util.ArrayList#elementData() [Inlined code]
0.47%         951           org.apache.lucene.util.hnsw.NeighborArray#size() [Inlined code]
0.45%         904           sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code]
0.44%         894           org.apache.lucene.util.FixedBitSet#getAndSet() [JIT compiled code]
0.40%         813           org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code]
0.36%         730           sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.35%         710           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$Float16ScoringSupplier$1#setScoringOrdinal() [Inlined code]
0.34%         699           org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.34%         689           org.apache.lucene.util.hnsw.HnswGraphSearcher#graphNextNeighbor() [Inlined code]
0.33%         677           jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code]
0.31%         623           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.30%         616           org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() [Inlined code]
0.26%         521           jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code]
0.24%         495           org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [Interpreted code]

Profiler for float32

63.09%        125971        jdk.incubator.vector.FloatVector#reduceLanesTemplate() [Inlined code]
5.72%         11426         jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
3.86%         7714          org.apache.lucene.index.FloatVectorValues$1#vectorValue() [Inlined code]
2.97%         5930          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
1.58%         3155          org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code]
1.35%         2691          org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
1.28%         2565          org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [JIT compiled code]
1.26%         2515          jdk.incubator.vector.FloatVector#fromArray0Template() [Inlined code]
1.25%         2500          org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code]
1.16%         2326          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
1.11%         2212          org.apache.lucene.util.hnsw.HnswGraphBuilder#diversityCheck() [Inlined code]
1.08%         2164          jdk.incubator.vector.FloatVector#lanewiseTemplate() [Inlined code]
1.02%         2029          org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code]
0.69%         1381          jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code]
0.58%         1165          org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code]
0.54%         1075          sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code]
0.53%         1067          sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.49%         985           org.apache.lucene.util.VectorUtil#normalizeToUnitInterval() [Inlined code]
0.45%         902           org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.43%         858           org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code]
0.41%         811           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1#setScoringOrdinal() [Inlined code]
0.39%         775           org.apache.lucene.util.GroupVIntUtil#readGroupVInt() [Inlined code]
0.38%         749           org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get() [JIT compiled code]
0.37%         739           sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.36%         716           org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [Interpreted code]
0.28%         556           jdk.incubator.vector.FloatVector#fromMemorySegment() [Inlined code]
0.24%         479           org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek() [Inlined code]
0.24%         475           java.util.concurrent.locks.ReentrantReadWriteLock#readLock() [Inlined code]
0.21%         427           java.util.ArrayList#elementData() [Inlined code]
0.20%         402           java.util.concurrent.locks.AbstractQueuedLongSynchronizer#compareAndSetState() [Inlined code]

Next Steps:

Understand the flame chart and try to further improve the float16 encoding benchmark runs.

Pulkitg64 · 2026-02-03T22:45:05Z

After adding support for bulk-scoring for float16 for dot-product score calculations, I am seeing improvement and better metrics than float32:

Seeing improvement of about 10% in latency as well as indexing rate.

Results:
NOTE: nDoc = 100000 for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: quantized = no for all runs; skipping column
NOTE: num_segments = 1 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: vec_disk(MB) = 390.625 for all runs; skipping column
NOTE: vec_RAM(MB) = 390.625 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column

Encoding	recall	latency(ms)	netCPU	avgCpuCount	visited	index(s)	index_docs/s	force_merge(s)	index_size(MB)
float16	0.989	1.781	1.740	0.977	6826	17.07	5859.60	0.01	207.66
float32	0.990	2.037	1.992	0.978	6865	18.84	5307.01	22.81	403.01

Pulkitg64 · 2026-02-04T01:46:31Z

These are the results with different quantizations:

With quantization enabled, we are seeing similar performance as float32 which is bit surprising to me because I thought float16 would be slower. This is because in my code, I have to inflate the fp16 vector to fp32 while quantizing vector. I will check and confirm if there is no mistake in benchmark runs.

Encoding	recall	latency(ms)	netCPU	avgCpuCount	quantized	visited	index(s)	index_docs/s	force_merge(s)	index_size(MB)	vec_disk(MB)	vec_RAM(MB)
float16	0.990	1.758	1.720	0.978	no	6822	17.03	5873.37	0.01	207.68	390.625	390.625
float16	0.982	2.954	2.889	0.978	8 bits	6851	19.5	5129.26	20.74	306.87	294.495	99.182
float16	0.927	2.07	2.025	0.978	4 bits	6934	19.75	5064.06	0.01	258.09	245.667	50.354
float16	0.717	1.119	1.094	0.978	1 bit	8165	18.76	5331.34	0.00	222.94	208.855	13.542
float32	0.990	2.208	2.158	0.977	no	6874	18.72	5341.31	22.91	403.03	390.625	390.625
float32	0.982	2.925	2.863	0.979	8 bits	6866	19.59	5105.95	29.53	502.2	489.807	99.182
float32	0.927	2.112	2.065	0.978	4 bits	6968	19.17	5217.57	22.13	453.41	440.979	50.354
float32	0.717	1.164	1.139	0.979	1 bit	8191	18.68	5353.03	18.08	418.29	404.167	13.542

Pulkitg64 · 2026-02-04T20:12:49Z

Since we are seeing good performance with FP16 now, I wanted to know what the path forward should be. The JDK PR for adding Float16Vector operations has not been merged yet, and even if it get merged, we likely cannot use it until the JDK 27 release (unless we have early access).

So, until we have that support, should we add a native implementation of FP16 scoring? This would be easier after #15508 (which adds native support in Lucene) gets merged. Once JDK 27 is released, we can switch to the Java implementation for scoring.

…rValues for fallback support

Pulkitg64 · 2026-02-11T03:34:18Z

I have removed the panama implementation for now but we can add them later once we have access to Float16Vector operations in JDK. Below are the benchmark numbers with default implementation:

Summary: Used 100k docs across all runs with force-merge. For no quantization case, we are seeing high regression in latency more than 100% (which is expected) but for quantization case we are seeing comparable latency. Indexing side, we are seeing regression in indexing time across all runs (irrespective if quantization is enabled or not). This is also expected because for quantization runs, we have to do extra conversion from fp16 to fp32 for quantizing vectors.

Encoding	recall	latency(ms)	netCPU	avgCpuCount	quantized	visited	index(s)	index_docs/s	force_merge(s)	index_size(MB)	vec_disk(MB)	vec_RAM(MB)
float16	0.990	5.739	5.738	1	no	6848	42.51	2352.66	0.00	207.68	390.625	390.625
float16	0.982	2.681	2.680	1	8 bits	6858	44.3	2257.34	20.51	306.88	294.495	99.182
float16	0.927	1.919	1.917	0.999	4 bits	6934	44.59	2242.55	0.01	258.09	245.667	50.354
float16	0.835	1.525	1.524	0.999	2 bits	7277	44.68	2237.99	0.01	234.01	221.062	25.749
float16	0.717	1.146	1.145	0.999	1 bit	8167	43.98	2273.92	0.01	222.96	208.855	13.542
float32	0.990	2.258	2.257	0.999	no	6863	19.84	5039.31	20.62	403.02	390.625	390.625
float32	0.982	2.756	2.754	1	8 bits	6867	21.29	4697.48	27.93	502.19	489.807	99.182
float32	0.927	1.91	1.909	0.999	4 bits	6962	20.03	4992.01	22.23	453.4	440.979	50.354
float32	0.835	1.462	1.461	0.999	2 bits	7302	20.45	4890.93	20.92	429.31	416.374	25.749
float32	0.717	1.174	1.173	0.999	1 bit	8205	20.16	4959.33	17.57	418.29	404.167	13.542

Next Steps:

If we are okay with above performance numbers so should we go ahead with this PR which is adding float16 vectorEncoding support without panama implementation or should we park this PR and wait for JDK27 release?
CC: @rmuir @benwtrent

mikemccand · 2026-02-18T13:57:11Z

Thanks @Pulkitg64, this is a very exciting change. It's frustrating to receive fp16 vectors (at Amazon Customer facing product search team) for indexing and have to fluff them up to fp32 entirely, before then quantizing them down to more sane (1,2,4,8 bit) bits per dim. And because these fluffy vectors take 2X the storage they really should have, we build ways to drop them from read-only replica indices.

It would be so much better if Lucene could handle incoming vectors entirely as their original fp16 form (this PR).

So, it's JDK 27 which will introduce Panama access to fp16 SIMD capabilities? And modern CPUs generally have good supports for fp16? And today (pre-JDK27) this PR must emulate (simple java code) the fp16 operations? And that's why it's slower?

If we enabled users to swap in their own PanamaVectorUtilSupport (#15508 -- whoa, merged!), users could in theory make a gcc-compiled auto-vectorized (via gcc backend) and then accessible through JNA/I, and have good performance, before JDK 27?

I haven't looked closely at the code changes yet ... just trying to get a grip on the high level situation. Thanks @Pulkitg64.

msokolov · 2026-02-20T15:23:56Z

It seems as if this PR has merit as-is without any special SIMD support either from JDK or via custom vector supprot provider implementation because, at least in the case where we are quantizing the vectors, which I think will be the default at this point for most applications, the performance is on-par with using fp32, maybe a little better, and we can reduce storage requirements, and also enable ingesting fp16 vectors without casting.

msokolov

I got part way through reviewing -- there's a lot of code here! One thing I didn't like is all the if statements we had to add to Lucene104ScalarQuantizedVectorsWriter. I see why it's needed but can't help wishing there was a neater way to organize the branching on float16/float32 implementations. Maybe simply factoring out those conditionals into private utility methods for better readability? Or propagating generics further into the class hierarchy, although I'm a little afraid of where that could lead.

msokolov · 2026-02-20T15:26:07Z

  }

+  @Benchmark
+  public float shortDotProductScalar() {


This treats the short[] as an array of fp16 right? Maybe we should change the name of the method to `fp16DotProductScalar1

Make sense!

msokolov · 2026-02-20T15:28:27Z

+  public Float16VectorValues getFloat16VectorValues(String field) throws IOException {
+    FieldInfo info = readState.fieldInfos.fieldInfo(field);
+    if (info == null) {
+      // mirror the handling in Lucene90VectorReader#getVectorValues


This comment seems out of date since Lucene90VectorReader no longer exists. I'd simply delete the whole comment

I think I have copied this from getFloatVectorValues. I will fix it in next revision.

msokolov · 2026-02-20T15:28:35Z

+    }
+    FieldEntry fieldEntry = fieldEntries.get(info.number);
+    if (fieldEntry == null) {
+      // mirror the handling in Lucene90VectorReader#getVectorValues


msokolov · 2026-02-20T15:35:58Z

+      if (fieldData.fieldInfo.getVectorEncoding() == VectorEncoding.FLOAT32) {
+        corrections =
+            scalarQuantizer.scalarQuantize(
+                (float[]) fieldData.getVectors().get(i),


hmm, even using generics we still have a cast here -- I wonder if the introduction of generics is worth it

msokolov · 2026-02-20T15:37:13Z

  public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException {
-    if (!fieldInfo.getVectorEncoding().equals(VectorEncoding.FLOAT32)) {
+    VectorEncoding vectorEncoding = fieldInfo.getVectorEncoding();
+    if (vectorEncoding != VectorEncoding.FLOAT32 && vectorEncoding != VectorEncoding.FLOAT16) {


Maybe we could introduce a method VectorEncoding.isFloatingPoint()?

msokolov · 2026-02-20T15:41:54Z

  }

-  static class FieldWriter extends FlatFieldVectorsWriter<float[]> {
+  private abstract static class FieldWriter<T> extends FlatFieldVectorsWriter<T> {


ah I see, we already had generics so we're almost forced to use it here now

msokolov · 2026-02-20T15:43:18Z

-    public List<float[]> getVectors() {
-      return flatFieldVectorsWriter.getVectors();
+    @SuppressWarnings("unchecked")
+    static FieldWriter<?> create(


would it help to have co-varying with FlatFieldVectorsWriter<T> instead of <?>?

github-actions · 2026-03-13T00:35:15Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Pulkit Gupta added 9 commits December 22, 2025 14:07

Add float16 support

2854714

Do not add support for float16

f0f6224

Add support in test cases as well

09cb1ca

Add support in test cases

e6d7da0

Fixed byteSize calculation

237a506

tidy

232cd17

Fixed float16 vector generator

2145c90

Fixed test cases

d86281c

Tidy

076a51c

github-actions Bot added module:core/store module:core/index module:core/search module:core/codecs module:test-framework module:sandbox module:core/hnsw labels Jan 5, 2026

Pulkitg64 marked this pull request as draft January 5, 2026 18:41

Pulkitg64 mentioned this pull request Jan 5, 2026

Should we add bfloat16 support for HNSW? #12403

Open

Add java docs

7e4689f

Try Float16Vector for vectorscore computation

6599f80

github-actions Bot added module:highlighter module:queries labels Jan 15, 2026

Fix dotProduct function when memorysegment is input

f0a71d0

Pulkit Gupta added 2 commits February 3, 2026 22:28

Add bulk vector support for float16 dot-product scoring

dcbd9fc

tidy

e589f82

Pulkit Gupta added 5 commits February 9, 2026 19:50

cleanup changes and tidy

8a48524

More cleanups and added missing support for float16 vector scoring

b85137a

Remove panama and memory segment implementation of float16

c86dfe1

Merge branch 'main' into float16

f316095

Added entry to changes.txt and fixed some writer classes

5d2a9a5

github-actions Bot added this to the 10.5.0 milestone Feb 10, 2026

Cleanup vectorUtil class and added OffHeapScalarQuantizedFloat16Vecto…

a08cba9

…rValues for fallback support

Pulkitg64 marked this pull request as ready for review February 10, 2026 21:19

msokolov reviewed Feb 20, 2026

View reviewed changes

github-actions Bot added the Stale label Mar 13, 2026

Conversation

Pulkitg64 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

benwtrent commented Jan 5, 2026

Uh oh!

benwtrent commented Jan 5, 2026

Uh oh!

rmuir commented Jan 5, 2026

Uh oh!

rmuir commented Jan 5, 2026

Uh oh!

Pulkitg64 commented Jan 6, 2026

Uh oh!

Pulkitg64 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Jan 7, 2026

Uh oh!

Pulkitg64 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps:

Uh oh!

rmuir commented Jan 15, 2026

Uh oh!

Pulkitg64 commented Jan 16, 2026

Uh oh!

rmuir commented Jan 16, 2026

Uh oh!

Pulkitg64 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pulkitg64 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pulkitg64 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps:

Uh oh!

Pulkitg64 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pulkitg64 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pulkitg64 commented Feb 4, 2026

Uh oh!

Pulkitg64 commented Feb 11, 2026

Next Steps:

Uh oh!

mikemccand commented Feb 18, 2026

Uh oh!

msokolov commented Feb 20, 2026

Uh oh!

msokolov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Pulkitg64 commented Jan 5, 2026 •

edited

Loading

Pulkitg64 commented Jan 7, 2026 •

edited

Loading

Pulkitg64 commented Jan 15, 2026 •

edited

Loading

Pulkitg64 commented Jan 20, 2026 •

edited

Loading

Pulkitg64 commented Jan 20, 2026 •

edited

Loading

Pulkitg64 commented Feb 3, 2026 •

edited

Loading

Pulkitg64 commented Feb 3, 2026 •

edited

Loading

Pulkitg64 commented Feb 4, 2026 •

edited

Loading