Skip to content

Add half-float (FP16) storage support for vectors#15549

Open
Pulkitg64 wants to merge 20 commits intoapache:mainfrom
Pulkitg64:float16
Open

Add half-float (FP16) storage support for vectors#15549
Pulkitg64 wants to merge 20 commits intoapache:mainfrom
Pulkitg64:float16

Conversation

@Pulkitg64
Copy link
Copy Markdown
Contributor

@Pulkitg64 Pulkitg64 commented Jan 5, 2026

Description

This draft PR explores storing float vectors using 2 bytes (half-float/FP16) instead of 4 bytes (FP32), reducing vector disk usage by approximately 50%. The approach involves storing vectors on disk in half-float format while converting them back to full-float precision for dot-product computations during search and index merge operations. However, this conversion step introduces additional overhead during vector reads, resulting in slower indexing and search performance.

This is an early draft to gather community feedback on the viability and direction of this implementation..

TODO : Support for MemorySegmentVectorScorer with half-float vectors is yet to be implemented.

  • Benchmark Results:

For no quantization, we are seeing around 100% increase in latency. For 8bit quantization, we are not seeing latency regression but for 4 bit we are seeing about 18% latency regression. We are seeing 20-25% drop in indexing rate across all quantization.

Encoding recall  latency(ms)  quantized  index(s)  index_docs/s  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
float16  0.991 11.392  no 34.8 2873.81  206.22 390.625  390.625
float16  0.981  4.337 8 bits 41.55 2406.97 305.4 294.495 99.182
float16  0.926  6.069 4 bits 42.07 2376.93  256.58 245.667 50.354
float32  0.991  4.942 no 28.93 3456.38 401.53 390.625  390.625
float32  0.981  4.367 8 bits 32.04 3121.49 500.71 489.807 99.182
float32  0.926  5.343 4 bits 32.12 3113.33 451.91 440.979 50.354

@benwtrent
Copy link
Copy Markdown
Member

@Pulkitg64 the latency is the main concern IMO. We must copy the vectors onto heap (we know this is expensive), transform the bytes to float32 (which is an additional cost), then do the float32 panama vector actions (which are super fast). I would expect this to also impact quantization query time for anything that must rescore (though, likely less of an impact as that would be fewer vectors to decode).

I wonder if all the cost is spent just decoding the vector? What does a flame graph tell you?

Also, could you indicate your JVM, etc.?

See this interesting jep update on the ever incubating vector API:

https://openjdk.org/jeps/508

Addition, subtraction, division, multiplication, square root, and fused multiply/add operations on Float16 values are now auto-vectorized on supporting x64 CPUs.

@benwtrent
Copy link
Copy Markdown
Member

@Pulkitg64 also, thank you for doing an initial pass and benchmarking, its important data :D.

I wonder if we want a true element type vs. a new format?

The element type has indeed expanded its various uses, but for many of them, Float16 isn't that much different than float (e.g. you still likely query & index with float[], still use FloatVectorValues, etc.). The only difference is the on disk representation (which...seems like a format thing).

This is just an idea. I am not 100% sold either way. Looking for discussion.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jan 5, 2026

You need https://bugs.openjdk.org/browse/JDK-8370691 for this one to be performant.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jan 5, 2026

Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR.

Code in o.a.l.internal.vectorization will be needed that takes advantage of the new Float16Vector or whatever the name ends out being. I would try to keep it looking as close to the existing 32-bit float code as possible.

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Thanks @benwtrent, @rmuir for such quick responses.

Let me try to gather some more data to confirm if the conversion is driving the regression.

Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR.

Code in o.a.l.internal.vectorization will be needed that takes advantage of the new Float16Vector or whatever the name ends out being. I would try to keep it looking as close to the existing 32-bit float code as possible.

Trying now

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Jan 7, 2026

Here is the output difference from profiler between float16 and float32 benchmark runs for no quantization. Based on below comparison, it can be clearly seen the additional latency in float16 benchmark run is coming while reading float16 vectors.

Screenshot 2026-01-07 at 12 11 15 PM

Also, could you indicate your JVM, etc.?

I am running these test on x86 machine with JDK25

java --version
openjdk 25.0.1 2025-10-21 LTS
OpenJDK Runtime Environment Corretto-25.0.1.9.1 (build 25.0.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-25.0.1.9.1 (build 25.0.1+9-LTS, mixed mode, sharing)

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jan 7, 2026

stop converting. use the native fp16 type (and vector type), otherwise code will be slow

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Jan 15, 2026

I don't have any good news right now, but sharing some progress, since it has been more than a week now so here it is:

I tried using the JDK PR for float16 computation as suggested by @rmuir. For this I had to checkout the JDK and pull the PR locally and build that and use that for building lucene code.
Then I tried adding new APIs to support Float16 vectors everywhere like introduced new KnnFloat16VectorField, KnnFloat16VectorQuery, Scorer etc. (I think I should not have wasted time to do this in first place but instead focus only on vectorScore implementation and it's benchmarks but I have learnt the lesson)

After doing all above stuffs, I ran benchmark below

  • With DefaultVectorUtilSupport (NumDocs: 100k)

    For defaultVectorUtilSupport, I implemented below function for dot-product score, which is converting to float32 vectors first before doing any computation. With this I am seeing regression in the latency (around 100%) for no quantization, because of the extra conversion. For the quantization cases, the latency is comparable, but the indexing is way slower, I think this is again because of conversion of shorts to floats during vector quantization (I think I can try to optimize it).

    @Override
    public short dotProduct(short[] a, short[] b) {
     assert a.length == b.length : "Vector lengths must match";
    
     float sum = 0f;
     for (int i = 0; i < a.length; i++) {
       sum = Math.fma(
           Float.float16ToFloat(a[i]),
           Float.float16ToFloat(b[i]),
           sum
       );
     }
     return Float.floatToFloat16(sum);
    }
    
    
Encoding recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s force_merge(s) index_size(MB) vec_disk(MB) vec_RAM(MB)
float16 0.989 9.924 9.868 0.994 no 5659 75.67 1321.46 0 206.19 390.625 390.625
float16 0.981 4.911 4.884 0.994 8 bits 5680 82.27 1215.54 40.13 305.41 294.495 99.182
float16 0.926 5.885 5.849 0.994 4 bits 5727 82.98 1205.08 0 256.59 245.667 50.354
float32 0.991 5.123 5.103 0.996 no 5680 28.56 3501.16 44.64 401.53 390.625 390.625
float32 0.981 4.692 4.675 0.997 8 bits 5689 32.29 3097.13 52.05 500.71 489.807 99.182
float32 0.926 5.822 5.786 0.994 4 bits 5728 32.62 3065.79 64.55 451.9 440.979 50.354
  • With PanamaVectorUtilSupport (NumDocs: 10k only because 100k was taking too much time)

    With Float16 Panama implementation, I am seeing very bad results (almost 40 times higher latency). I checked the profiler results and there is one JDK internal function call (VectorPayload.getPayload())which is taking the most time. I am yet to understand why that function call is taking too long

encoding recall  latency(ms) netCPU  avgCpuCount  visited  index(s)  index_docs/s  index_size(MB)
float16  0.996 35.076  137.727  3.927  16072 29.70  336.71 19.96
float32  0.998  0.865  3.139  3.630  16088  1.95 5117.71 39.49

Profiler output for float16:

PERCENT       CPU SAMPLES   STACK
66.03%        217295        jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code]
10.08%        33172         jdk.incubator.vector.Float16Vector#tOpTemplate() [Inlined code]
7.90%         25995         jdk.incubator.vector.Float16#valueOf() [Inlined code]
6.16%         20265         jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code]
5.07%         16691         jdk.incubator.vector.Float16#lambda$fma$0() [Inlined code]
1.54%         5084          jdk.incubator.vector.Float16Vector256#vectorFactory() [Inlined code]
0.40%         1323          jdk.incubator.vector.Float16#shortBitsToFloat16() [Inlined code]
0.35%         1160          jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code]
0.30%         988           jdk.internal.vm.vector.VectorSupport#ternaryOp() [JIT compiled code]
0.23%         757           jdk.jfr.internal.JVM#emitEvent() [Native code]
0.21%         683           jdk.internal.vm.vector.VectorSupport$VectorPayload#<init>() [Inlined code]
0.16%         523           jdk.incubator.vector.Float16Vector$$Lambda.0x000000003811dec0#apply() [Inlined code]
0.15%         506           jdk.incubator.vector.Float16Vector#bOpTemplate() [Inlined code]
0.10%         340           jdk.incubator.vector.Float16Vector256#vec() [Inlined code]
0.10%         333           org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
0.09%         302           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.09%         289           jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code]
0.09%         288           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.07%         231           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.05%         171           jdk.incubator.vector.Float16Vector#lambda$reductionOperations$1() [Inlined code]
0.03%         112           jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code]
0.03%         98            sun.nio.fs.UnixNativeDispatcher#open0() [Native code]
0.02%         82            java.util.TimSort#binarySort() [JIT compiled code]
0.02%         79            jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
0.02%         77            jdk.internal.vm.vector.VectorSupport#binaryOp() [JIT compiled code]
0.02%         73            sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.02%         60            org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts#readInts8() [Inlined code]
0.02%         59            org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() [JIT compiled code]
0.02%         56            org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.02%         52            org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]

Profiler output for float32: (Incorrect Output, correct one shared here)

PERCENT       CPU SAMPLES   STACK
65.82%        679346        org.apache.lucene.internal.vectorization.DefaultVectorUtilSupport#fma() [Inlined code]
27.94%        288356        org.apache.lucene.internal.vectorization.DefaultVectorUtilSupport#dotProduct() [Inlined code]
1.31%         13488         org.apache.lucene.index.Float16VectorValues$1#vectorValue() [Inlined code]
0.36%         3739          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
0.36%         3677          org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code]
0.31%         3227          org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code]
0.30%         3098          org.apache.lucene.util.VectorUtil#dotProduct() [Inlined code]
0.30%         3075          org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code]
0.25%         2567          org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
0.20%         2055          java.util.Arrays#fill() [Inlined code]
0.15%         1594          org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code]
0.14%         1489          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.13%         1374          org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [JIT compiled code]
0.10%         1046          jdk.jfr.internal.JVM#emitEvent() [Native code]
0.09%         931           org.apache.lucene.codecs.lucene95.OffHeapFloat16VectorValues#vectorValue() [Inlined code]
0.09%         915           org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code]
0.08%         778           org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.07%         734           java.util.concurrent.locks.AbstractQueuedLongSynchronizer#apparentlyFirstQueuedIsExclusive() [Inlined code]
0.06%         615           java.util.ArrayList#elementData() [Inlined code]
0.06%         574           sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code]
0.05%         564           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.05%         514           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$Float16ScoringSupplier$1#score() [Inlined code]
0.05%         507           jdk.internal.foreign.AbstractMemorySegmentImpl#copy() [Inlined code]
0.05%         501           jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code]
0.05%         479           org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
0.04%         461           jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds() [Inlined code]
0.04%         426           org.apache.lucene.util.ArrayUtil#growExact() [Inlined code]
0.04%         414           org.apache.lucene.util.hnsw.HnswGraphSearcher#graphNextNeighbor() [Inlined code]
0.04%         412           org.apache.lucene.util.hnsw.NeighborArray#addOutOfOrder() [Inlined code]
0.04%         391           org.apache.lucene.util.packed.DirectMonotonicReader#get() [Inlined code]

Next Steps:

  • Try and understand, why there is so much regression with Float16 panama support and understand the profiler results better.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jan 15, 2026

I looked at your commented-out code here and it doesn't seem to use Float16Vector class but is instead doing a bunch of conversions and scalar operations

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Hi @rmuir ,

I looked at your commented-out code here and it doesn't seem to use Float16Vector class but is instead doing a bunch of conversions and scalar operations

Actually for the defaultUtilSupport (not using panama), I tried three different approaches (that's why other 2 are commented out), but the approach 1 gave the best performance in my benchmarks

  • Approach 1 (Best performance): In this I am converting the short values to float32 values, then passing the values to fma function.
JMH:
Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.shortDotProductScalar    1024  thrpt   15  0.753 ± 0.001  ops/us

Code: 
@Override
  public short dotProduct(short[] a, short[] b) {
    assert a.length == b.length : "Vector lengths must match";

    float sum = 0f;
    for (int i = 0; i < a.length; i++) {
      sum = Math.fma(
          Float.float16ToFloat(a[i]),
          Float.float16ToFloat(b[i]),
          sum
      );
    }
    return Float.floatToFloat16(sum);
  }
  • Approach 2: In this I am using Float16 class objects to assign values and use Float16.fma function for computation. In the implementation, the Float16 object is converted to float32 array internally. Hence I think this is not performant enough.
JMH:
Benchmark                                  (size)   Mode  Cnt  Score    Error   Units
VectorUtilBenchmark.shortDotProductScalar    1024  thrpt   15  0.077 ±  0.001  ops/us

Code: 
@Override
public short dotProduct(short[] a, short[] b) {
    assert a.length == b.length : "Vector lengths must match";
   Float16 sum = Float16.valueOf(0);
    for (int i = 0; i < a.length; i++) {
    sum = Float16.fma(Float16.shortBitsToFloat16(a[i]), Float16.shortBitsToFloat16(b[i]), sum);
   }
    return sum.shortValue();
  }
  • Approach 3: This is extension to Approach 1 where I am trying to use loop unrolling, but I am not seeing any difference in performance.
JMH:
Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.shortDotProductScalar    1024  thrpt   15  0.748 ± 0.002  ops/us

Code
@Override
  public short dotProduct(short[] a, short[] b) {
    float res = 0f;
    int i = 0;

    // if the array is big, unroll it
    if (a.length > 32) {
      float acc1 = 0f;
      float acc2 = 0f;
      float acc3 = 0f;
      float acc4 = 0f;
      int upperBound = a.length & ~(4 - 1);
      for (; i < upperBound; i += 4) {
        acc1 = fma(Float.float16ToFloat(a[i]),     Float.float16ToFloat(b[i]),     acc1);
        acc2 = fma(Float.float16ToFloat(a[i + 1]), Float.float16ToFloat(b[i + 1]), acc2);
        acc3 = fma(Float.float16ToFloat(a[i + 2]), Float.float16ToFloat(b[i + 2]), acc3);
        acc4 = fma(Float.float16ToFloat(a[i + 3]), Float.float16ToFloat(b[i + 3]), acc4);
      }
      res += acc1 + acc2 + acc3 + acc4;
    }

    for (; i < a.length; i++) {
      res = fma(Float.float16ToFloat(a[i]), Float.float16ToFloat(b[i]), res);
    }
    return Float.floatToFloat16(res);
  }
  • Note:

The Float16Vector is used in PanamaVectorUtilSupport class for which we are seeing very bad performance as explained in my above comment. (Sorry for the confusion, the PR size makes it difficult to navigate). But please let me know if you meant something else in your comment.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jan 16, 2026

as i said, you aren't using the vector classes correctly

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Jan 20, 2026

Hi,
Perhaps I'm misunderstanding something, but based on the profiler results here, I think panama implementation is using Float16Vector class correctly. Could you point me to the function which you are referring to?

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Jan 20, 2026

Sorry, earlier I pasted wrong profiler output above for float32 implementation. This is the correct profiler output for float32:

39.85%        2457          jdk.incubator.vector.FloatVector#lanewiseTemplate() [Inlined code]
5.97%         368           jdk.incubator.vector.FloatVector#reduceLanesTemplate() [Inlined code]
5.43%         335           org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
4.61%         284           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
3.99%         246           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
2.74%         169           jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
1.43%         88            org.apache.lucene.document.StoredField#<init>() [Inlined code]
1.41%         87            org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() [JIT compiled code]
1.35%         83            java.util.TimSort#binarySort() [JIT compiled code]
1.12%         69            jdk.incubator.vector.FloatVector#fromArray0Template() [Inlined code]
1.10%         68            org.apache.lucene.internal.vectorization.Lucene99MemorySegmentFloatVectorScorer#bulkScore() [JIT compiled code]
0.96%         59            org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress() [JIT compiled code]
0.91%         56            sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.86%         53            org.apache.lucene.internal.vectorization.MemorySegmentBulkVectorOps$DotProduct#dotProductBulkImpl() [Interpreted code]
0.79%         49            org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.73%         45            org.apache.lucene.util.TernaryLongHeap#updateTop() [JIT compiled code]
0.68%         42            org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts#readInts8() [Inlined code]
0.65%         40            jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.63%         39            sun.nio.fs.UnixNativeDispatcher#open0() [Native code]
0.62%         38            org.apache.lucene.util.TernaryLongHeap#insertWithOverflow() [Inlined code]
0.57%         35            java.util.HashMap#resize() [JIT compiled code]
0.52%         32            org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
0.45%         28            org.apache.lucene.util.VectorUtil#normalizeToUnitInterval() [Inlined code]
0.44%         27            java.lang.Integer#valueOf() [Inlined code]
0.42%         26            org.apache.lucene.search.TaskExecutor#invokeAll() [JIT compiled code]
0.42%         26            jdk.jfr.internal.JVM#emitEvent() [Native code]
0.42%         26            java.util.zip.Inflater#inflateBytesBytes() [Native code]
0.42%         26            org.apache.lucene.util.hnsw.RandomVectorScorer$AbstractRandomVectorScorer#ordToDoc() [Inlined code]
0.42%         26            java.util.TimSort#mergeLo() [JIT compiled code]
0.41%         25            org.apache.lucene.util.hnsw.NeighborQueue#encode() [Inlined code]

and this is for float16:

66.03%        217295        jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code]
10.08%        33172         jdk.incubator.vector.Float16Vector#tOpTemplate() [Inlined code]
7.90%         25995         jdk.incubator.vector.Float16#valueOf() [Inlined code]
6.16%         20265         jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code]
5.07%         16691         jdk.incubator.vector.Float16#lambda$fma$0() [Inlined code]
1.54%         5084          jdk.incubator.vector.Float16Vector256#vectorFactory() [Inlined code]
0.40%         1323          jdk.incubator.vector.Float16#shortBitsToFloat16() [Inlined code]
0.35%         1160          jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code]
0.30%         988           jdk.internal.vm.vector.VectorSupport#ternaryOp() [JIT compiled code]
0.23%         757           jdk.jfr.internal.JVM#emitEvent() [Native code]
0.21%         683           jdk.internal.vm.vector.VectorSupport$VectorPayload#<init>() [Inlined code]
0.16%         523           jdk.incubator.vector.Float16Vector$$Lambda.0x000000003811dec0#apply() [Inlined code]
0.15%         506           jdk.incubator.vector.Float16Vector#bOpTemplate() [Inlined code]
0.10%         340           jdk.incubator.vector.Float16Vector256#vec() [Inlined code]
0.10%         333           org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
0.09%         302           org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.09%         289           jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code]
0.09%         288           jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw() [Inlined code]
0.07%         231           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.05%         171           jdk.incubator.vector.Float16Vector#lambda$reductionOperations$1() [Inlined code]
0.03%         112           jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code]
0.03%         98            sun.nio.fs.UnixNativeDispatcher#open0() [Native code]
0.02%         82            java.util.TimSort#binarySort() [JIT compiled code]
0.02%         79            jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
0.02%         77            jdk.internal.vm.vector.VectorSupport#binaryOp() [JIT compiled code]
0.02%         73            sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.02%         60            org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts#readInts8() [Inlined code]
0.02%         59            org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search() [JIT compiled code]
0.02%         56            org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.02%         52            org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Feb 3, 2026

I think I found the problem. I was running these benchmarks on m5.12x large machines. This instance doesn't support float16 intrinsic operations. So, I changed my instance to m7g.8x large machines and here are the results:

I am seeing much better performance with float16 encoding now. The latency with float16 is still 50% higher than float32. Also I haven't implemented bulk scoring as well, so maybe that will help us in some latency . The indexing rate is improved by 10% (this maybe because of fast fetching of smaller vectors).

Encoding recall  latency(ms)  netCPU  avgCpuCount  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)
float16  0.992  3.229 3.154 0.977 6820 17.01 5879.93  0.01  207.65
float32  0.990  2.111 2.066 0.978 6858 19.18 5214.04 22.81  403.03
  • Profiler for float16:
40.69%        82592         jdk.incubator.vector.Float16Vector#reduceLanesTemplate() [Inlined code]
20.50%        41612         org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [JIT compiled code]
5.41%         10983         jdk.incubator.vector.Float16Vector#fromArray0Template() [Inlined code]
5.00%         10158         org.apache.lucene.index.Float16VectorValues$1#vectorValue() [Inlined code]
3.92%         7964          jdk.internal.vm.vector.VectorSupport#maybeRebox() [Inlined code]
2.21%         4488          jdk.internal.vm.vector.VectorSupport$VectorPayload#getPayload() [Inlined code]
1.71%         3467          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
1.43%         2909          org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code]
1.19%         2408          org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
1.18%         2386          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
0.87%         1763          java.lang.invoke.VarHandleSegmentAsInts#get() [Inlined code]
0.85%         1722          org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code]
0.75%         1514          org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code]
0.63%         1278          org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code]
0.62%         1251          jdk.incubator.vector.Float16Vector#lanewiseTemplate() [Inlined code]
0.61%         1247          org.apache.lucene.util.hnsw.HnswGraphBuilder#diversityCheck() [JIT compiled code]
0.47%         961           java.util.ArrayList#elementData() [Inlined code]
0.47%         951           org.apache.lucene.util.hnsw.NeighborArray#size() [Inlined code]
0.45%         904           sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code]
0.44%         894           org.apache.lucene.util.FixedBitSet#getAndSet() [JIT compiled code]
0.40%         813           org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code]
0.36%         730           sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.35%         710           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$Float16ScoringSupplier$1#setScoringOrdinal() [Inlined code]
0.34%         699           org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.34%         689           org.apache.lucene.util.hnsw.HnswGraphSearcher#graphNextNeighbor() [Inlined code]
0.33%         677           jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code]
0.31%         623           sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.30%         616           org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct() [Inlined code]
0.26%         521           jdk.incubator.vector.Float16Vector#rOpTemplate() [Inlined code]
0.24%         495           org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [Interpreted code]
  • Profiler for float32
63.09%        125971        jdk.incubator.vector.FloatVector#reduceLanesTemplate() [Inlined code]
5.72%         11426         jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
3.86%         7714          org.apache.lucene.index.FloatVectorValues$1#vectorValue() [Inlined code]
2.97%         5930          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
1.58%         3155          org.apache.lucene.util.FixedBitSet#getAndSet() [Inlined code]
1.35%         2691          org.apache.lucene.util.TernaryLongHeap#downHeap() [Inlined code]
1.28%         2565          org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [JIT compiled code]
1.26%         2515          jdk.incubator.vector.FloatVector#fromArray0Template() [Inlined code]
1.25%         2500          org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder$MergeSearcher#graphSeek() [JIT compiled code]
1.16%         2326          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#doReset() [JIT compiled code]
1.11%         2212          org.apache.lucene.util.hnsw.HnswGraphBuilder#diversityCheck() [Inlined code]
1.08%         2164          jdk.incubator.vector.FloatVector#lanewiseTemplate() [Inlined code]
1.02%         2029          org.apache.lucene.util.hnsw.OnHeapHnswGraph#getNeighbors() [Inlined code]
0.69%         1381          jdk.internal.misc.ScopedMemoryAccess#getByteInternal() [Inlined code]
0.58%         1165          org.apache.lucene.util.hnsw.OnHeapHnswGraph#nextNeighbor() [Inlined code]
0.54%         1075          sun.nio.ch.UnixFileDispatcherImpl#write0() [Native code]
0.53%         1067          sun.nio.ch.UnixFileDispatcherImpl#force0() [Native code]
0.49%         985           org.apache.lucene.util.VectorUtil#normalizeToUnitInterval() [Inlined code]
0.45%         902           org.apache.lucene.util.TernaryLongHeap#upHeap() [Inlined code]
0.43%         858           org.apache.lucene.util.hnsw.NeighborArray#nodes() [Inlined code]
0.41%         811           org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier$1#setScoringOrdinal() [Inlined code]
0.39%         775           org.apache.lucene.util.GroupVIntUtil#readGroupVInt() [Inlined code]
0.38%         749           org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get() [JIT compiled code]
0.37%         739           sun.nio.ch.UnixFileDispatcherImpl#read0() [Native code]
0.36%         716           org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [Interpreted code]
0.28%         556           jdk.incubator.vector.FloatVector#fromMemorySegment() [Inlined code]
0.24%         479           org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek() [Inlined code]
0.24%         475           java.util.concurrent.locks.ReentrantReadWriteLock#readLock() [Inlined code]
0.21%         427           java.util.ArrayList#elementData() [Inlined code]
0.20%         402           java.util.concurrent.locks.AbstractQueuedLongSynchronizer#compareAndSetState() [Inlined code]

Next Steps:

Understand the flame chart and try to further improve the float16 encoding benchmark runs.

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Feb 3, 2026

After adding support for bulk-scoring for float16 for dot-product score calculations, I am seeing improvement and better metrics than float32:

Seeing improvement of about 10% in latency as well as indexing rate.

Results:
NOTE: nDoc = 100000 for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: quantized = no for all runs; skipping column
NOTE: num_segments = 1 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: vec_disk(MB) = 390.625 for all runs; skipping column
NOTE: vec_RAM(MB) = 390.625 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column

Encoding recall  latency(ms)  netCPU  avgCpuCount  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)
float16  0.989  1.781 1.740  0.977 6826 17.07 5859.60  0.01  207.66
float32  0.990  2.037 1.992  0.978 6865 18.84 5307.01 22.81  403.01

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Pulkitg64 commented Feb 4, 2026

These are the results with different quantizations:

With quantization enabled, we are seeing similar performance as float32 which is bit surprising to me because I thought float16 would be slower. This is because in my code, I have to inflate the fp16 vector to fp32 while quantizing vector. I will check and confirm if there is no mistake in benchmark runs.

Encoding recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
float16  0.990  1.758 1.720  0.978 no 6822 17.03 5873.37  0.01  207.68 390.625  390.625
float16  0.982  2.954 2.889  0.978 8 bits 6851 19.5 5129.26 20.74  306.87 294.495 99.182
float16  0.927  2.07 2.025  0.978 4 bits 6934 19.75 5064.06  0.01  258.09 245.667 50.354
float16  0.717  1.119 1.094  0.978 1 bit 8165 18.76 5331.34  0.00  222.94 208.855 13.542
float32  0.990  2.208 2.158  0.977 no 6874 18.72 5341.31 22.91  403.03 390.625  390.625
float32  0.982  2.925 2.863  0.979 8 bits 6866 19.59 5105.95 29.53  502.2 489.807 99.182
float32  0.927  2.112 2.065  0.978 4 bits 6968 19.17 5217.57 22.13  453.41 440.979 50.354
float32  0.717  1.164 1.139  0.979 1 bit 8191 18.68 5353.03 18.08  418.29 404.167 13.542

@Pulkitg64
Copy link
Copy Markdown
Contributor Author

Since we are seeing good performance with FP16 now, I wanted to know what the path forward should be. The JDK PR for adding Float16Vector operations has not been merged yet, and even if it get merged, we likely cannot use it until the JDK 27 release (unless we have early access).

So, until we have that support, should we add a native implementation of FP16 scoring? This would be easier after #15508 (which adds native support in Lucene) gets merged. Once JDK 27 is released, we can switch to the Java implementation for scoring.

@github-actions github-actions Bot added this to the 10.5.0 milestone Feb 10, 2026
@Pulkitg64 Pulkitg64 marked this pull request as ready for review February 10, 2026 21:19
@Pulkitg64
Copy link
Copy Markdown
Contributor Author

I have removed the panama implementation for now but we can add them later once we have access to Float16Vector operations in JDK. Below are the benchmark numbers with default implementation:

Summary: Used 100k docs across all runs with force-merge. For no quantization case, we are seeing high regression in latency more than 100% (which is expected) but for quantization case we are seeing comparable latency. Indexing side, we are seeing regression in indexing time across all runs (irrespective if quantization is enabled or not). This is also expected because for quantization runs, we have to do extra conversion from fp16 to fp32 for quantizing vectors.

Encoding recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
float16  0.990  5.739 5.738  1 no 6848 42.51 2352.66  0.00  207.68 390.625  390.625
float16  0.982  2.681 2.680  1 8 bits 6858 44.3 2257.34 20.51  306.88 294.495 99.182
float16  0.927  1.919 1.917  0.999 4 bits 6934 44.59 2242.55  0.01  258.09 245.667 50.354
float16  0.835  1.525 1.524  0.999 2 bits 7277 44.68 2237.99  0.01  234.01 221.062 25.749
float16  0.717  1.146 1.145  0.999 1 bit 8167 43.98 2273.92  0.01  222.96 208.855 13.542
float32  0.990  2.258 2.257  0.999 no 6863 19.84 5039.31 20.62  403.02 390.625  390.625
float32  0.982  2.756 2.754  1 8 bits 6867 21.29 4697.48 27.93  502.19 489.807 99.182
float32  0.927  1.91 1.909  0.999 4 bits 6962 20.03 4992.01 22.23  453.4 440.979 50.354
float32  0.835  1.462 1.461  0.999 2 bits 7302 20.45 4890.93 20.92  429.31 416.374 25.749
float32  0.717  1.174 1.173  0.999 1 bit 8205 20.16 4959.33 17.57  418.29 404.167 13.542

Next Steps:

If we are okay with above performance numbers so should we go ahead with this PR which is adding float16 vectorEncoding support without panama implementation or should we park this PR and wait for JDK27 release?
CC: @rmuir @benwtrent

@mikemccand
Copy link
Copy Markdown
Member

Thanks @Pulkitg64, this is a very exciting change. It's frustrating to receive fp16 vectors (at Amazon Customer facing product search team) for indexing and have to fluff them up to fp32 entirely, before then quantizing them down to more sane (1,2,4,8 bit) bits per dim. And because these fluffy vectors take 2X the storage they really should have, we build ways to drop them from read-only replica indices.

It would be so much better if Lucene could handle incoming vectors entirely as their original fp16 form (this PR).

So, it's JDK 27 which will introduce Panama access to fp16 SIMD capabilities? And modern CPUs generally have good supports for fp16? And today (pre-JDK27) this PR must emulate (simple java code) the fp16 operations? And that's why it's slower?

If we enabled users to swap in their own PanamaVectorUtilSupport (#15508 -- whoa, merged!), users could in theory make a gcc-compiled auto-vectorized (via gcc backend) and then accessible through JNA/I, and have good performance, before JDK 27?

I haven't looked closely at the code changes yet ... just trying to get a grip on the high level situation. Thanks @Pulkitg64.

@msokolov
Copy link
Copy Markdown
Contributor

It seems as if this PR has merit as-is without any special SIMD support either from JDK or via custom vector supprot provider implementation because, at least in the case where we are quantizing the vectors, which I think will be the default at this point for most applications, the performance is on-par with using fp32, maybe a little better, and we can reduce storage requirements, and also enable ingesting fp16 vectors without casting.

Copy link
Copy Markdown
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got part way through reviewing -- there's a lot of code here! One thing I didn't like is all the if statements we had to add to Lucene104ScalarQuantizedVectorsWriter. I see why it's needed but can't help wishing there was a neater way to organize the branching on float16/float32 implementations. Maybe simply factoring out those conditionals into private utility methods for better readability? Or propagating generics further into the class hierarchy, although I'm a little afraid of where that could lead.

}

@Benchmark
public float shortDotProductScalar() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This treats the short[] as an array of fp16 right? Maybe we should change the name of the method to `fp16DotProductScalar1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

public Float16VectorValues getFloat16VectorValues(String field) throws IOException {
FieldInfo info = readState.fieldInfos.fieldInfo(field);
if (info == null) {
// mirror the handling in Lucene90VectorReader#getVectorValues
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems out of date since Lucene90VectorReader no longer exists. I'd simply delete the whole comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have copied this from getFloatVectorValues. I will fix it in next revision.

}
FieldEntry fieldEntry = fieldEntries.get(info.number);
if (fieldEntry == null) {
// mirror the handling in Lucene90VectorReader#getVectorValues
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

if (fieldData.fieldInfo.getVectorEncoding() == VectorEncoding.FLOAT32) {
corrections =
scalarQuantizer.scalarQuantize(
(float[]) fieldData.getVectors().get(i),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, even using generics we still have a cast here -- I wonder if the introduction of generics is worth it

public void mergeOneField(FieldInfo fieldInfo, MergeState mergeState) throws IOException {
if (!fieldInfo.getVectorEncoding().equals(VectorEncoding.FLOAT32)) {
VectorEncoding vectorEncoding = fieldInfo.getVectorEncoding();
if (vectorEncoding != VectorEncoding.FLOAT32 && vectorEncoding != VectorEncoding.FLOAT16) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could introduce a method VectorEncoding.isFloatingPoint()?

}

static class FieldWriter extends FlatFieldVectorsWriter<float[]> {
private abstract static class FieldWriter<T> extends FlatFieldVectorsWriter<T> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, we already had generics so we're almost forced to use it here now

public List<float[]> getVectors() {
return flatFieldVectorsWriter.getVectors();
@SuppressWarnings("unchecked")
static FieldWriter<?> create(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it help to have co-varying with FlatFieldVectorsWriter<T> instead of <?>?

@github-actions
Copy link
Copy Markdown
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions Bot added the Stale label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants