GROOVY-10307: Improve invokedynamic performance with optimized caching#2374
GROOVY-10307: Improve invokedynamic performance with optimized caching#2374jamesfredley wants to merge 5 commits intoapache:GROOVY_4_0_Xfrom
Conversation
This commit significantly improves invokedynamic performance in Grails 7 / Groovy 4 by implementing several key optimizations: 1. **Disabled Global SwitchPoint Guard**: Removed the global SwitchPoint that caused ALL call sites to invalidate when ANY metaclass changed. In Grails and similar frameworks with frequent metaclass modifications, this was causing massive guard failures and performance degradation. Individual call sites now manage their own invalidation via cache clearing. Can be re-enabled with: -Dgroovy.indy.switchpoint.guard=true 2. **Monomorphic Fast-Path**: Added volatile fields (latestClassName/ latestMethodHandleWrapper) to CacheableCallSite to skip synchronization and map lookups for call sites that consistently see the same types, which is the common case in practice. 3. **Call Site Cache Invalidation**: Added call site registration and cache clearing mechanism. When metaclass changes, all registered call sites have their caches cleared and targets reset to the default (fromCache) path, ensuring metaclass changes are visible. 4. **Optimized Cache Operations**: Added get() and putIfAbsent() methods to CacheableCallSite for more efficient cache access patterns, and clearCache() for metaclass change invalidation. 5. **Pre-Guard Method Handle Storage**: Selector now stores the method handle before argument type guards are applied (handleBeforeArgGuards), enabling potential future optimizations for direct invocation. The thresholds can be tuned via system properties: - groovy.indy.optimize.threshold (default: 10000) - groovy.indy.fallback.threshold (default: 10000) - groovy.indy.callsite.cache.size (default: 4) - groovy.indy.switchpoint.guard (default: false)
|
I do not believe the |
Remove monomorphic fast-path, optimized cache ops, and pre-guard method handle storage. Testing showed these provide no measurable benefit in Grails 7 benchmarks while adding complexity. The minimal optimization (disabled SwitchPoint guard + cache invalidation) achieves the same performance improvement with 75% less code (69 lines vs 277 lines). Removed: - Monomorphic fast-path (latestClassName/latestMethodHandleWrapper fields) - Optimized cache operations (get/putIfAbsent methods) - Pre-guard method handle storage (handleBeforeArgGuards, directMethodHandle) Retained: - Disabled global SwitchPoint guard by default - Cache invalidation on metaclass changes - Call site registration for invalidation tracking
Optimizations Retained
Optimizations Removed (No Measurable Benefit)
Testing showed these additional optimizations provide no measurable performance benefit in the Grails 7 benchmark while adding complexity. |
| // sufficient, combined with cache invalidation on metaclass changes. | ||
| // | ||
| // If you need strict metaclass change detection, set groovy.indy.switchpoint.guard=true | ||
| if (SystemUtil.getBooleanSafe("groovy.indy.switchpoint.guard")) { |
There was a problem hiding this comment.
The one question I have about this is whether this check ought to be inverted. That is, should the default behaviour be the old way, which we know is less performant but more battle-tested, or this way? Could Grails simply turn this flag on by default, thus solving their problem, without impacting other users?
Also, as a nitpicky thing, I reckon this string ought to be a public static final field so that we could add some tests around it.
There was a problem hiding this comment.
If this PRs approach is viable, but does not prove to be generally valuable for all/most code run with Indy on, than I think the inverse would be fine for Grails and we could set that flag in default generated applications.
There was a problem hiding this comment.
This is part of why I'd like to run this branch against the current benchmarks on CI. That would likely provide some confidence as to whether this was "safe" to merge without impacting anyone's expectations about how the language behaves.
|
I'm also well outside my expertise here, but so long as we're doing this, I'd be curious to see how this runs against the performance tests we have, compared to baseline. I tried running That seems comparable, so my assumption is this hasn't hurt the most basic performance measures we do. Main reason I'd want CI confirmation is to make sure it wasn't a quirk of my machine. |
|
Based on GROOVY-10307 and GROOVY-11842, I created another test project: https://github.com/jamesfredley/groovy-indy-performance |
|
Test Date: January 30, 2026 Master Comparison TableAll values are averages across 3 runs. Times in milliseconds unless noted. Bold indicates best performance. Loop Benchmarks (ComprehensiveBenchmark)
Method Invocation Benchmarks (ComprehensiveBenchmark)
Closure Benchmarks (ComprehensiveBenchmark)
Other Benchmarks (ComprehensiveBenchmark)
Method Invocation Benchmarks (MethodInvocationBenchmark)
Closure Benchmarks (ClosureBenchmark)
Loop Benchmarks (LoopsBenchmark)
Call Site Invalidation (Critical for GROOVY-10307)
Summary Table: Key Metrics
|
|
|
||
| // Invalidate all call site caches and reset targets to default (cache lookup) | ||
| // This ensures metaclass changes are visible without using expensive switchpoint guards | ||
| ALL_CALL_SITES.removeIf(ref -> { |
There was a problem hiding this comment.
The size of ALL_CALL_SITES could be very big, so it's better to use parallel stream for better performance.
There was a problem hiding this comment.
When metaclass changes, all registered call sites have their caches cleared
Current implementation will clear all cache no matter whether the changed metaclasses are releated.
Groovy 5 has lookup instance in the CacheableCallSite, it can help us determine which class is using the cache site, maybe we can backport it to Groovy 4:
There was a problem hiding this comment.
I tested two versions of the parallel stream change with the two test application and saw a reduction in performance.
| Metric | Baseline | Approach 1 | Approach 2 |
|---|---|---|---|
| Grails App (steady state) | ~141 ms | ~167 ms (18% SLOWER) | ~158 ms (12% SLOWER) |
Approach 1: Parallel Collect + Atomic Swap - 18% SLOWER
// Invalidate all call site caches and reset targets to default (cache lookup)
// This ensures metaclass changes are visible without using expensive switchpoint guards
Set<WeakReference<CacheableCallSite>> liveReferences = ALL_CALL_SITES.parallelStream()
.filter(ref -> {
CacheableCallSite cs = ref.get();
if (cs == null) {
return false; // Don't keep garbage collected references
}
// Reset target to default (fromCache) so next call goes through cache lookup
MethodHandle defaultTarget = cs.getDefaultTarget();
if (defaultTarget != null && cs.getTarget() != defaultTarget) {
cs.setTarget(defaultTarget);
}
// Clear the cache so stale method handles are discarded
cs.clearCache();
return true; // Keep live references
})
.collect(Collectors.toSet());
// Atomic swap - clear and replace with live references only
ALL_CALL_SITES.clear();
ALL_CALL_SITES.addAll(liveReferences);
Approach 2: Parallel ForEach + Sequential RemoveIf - 12% SLOWER
// Invalidate all call site caches in parallel
ALL_CALL_SITES.parallelStream().forEach(ref -> {
CacheableCallSite cs = ref.get();
if (cs != null) {
// Reset target to default (fromCache) so next call goes through cache lookup
MethodHandle defaultTarget = cs.getDefaultTarget();
if (defaultTarget != null && cs.getTarget() != defaultTarget) {
cs.setTarget(defaultTarget);
}
// Clear the cache so stale method handles are discarded
cs.clearCache();
}
});
// Remove garbage collected references (must be sequential for ConcurrentHashMap.KeySetView)
ALL_CALL_SITES.removeIf(ref -> ref.get() == null);
There was a problem hiding this comment.
Testing with the Groovy 5 caller-specific Lookup pattern is generally positive on the micro benchmarks but 11% slower in the grails test application:
More details: jamesfredley#1
src/main/java/org/codehaus/groovy/vmplugin/v8/IndyInterface.java
Outdated
Show resolved
Hide resolved
70ff21f to
860f460
Compare
860f460 to
ca36b8c
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves invokedynamic performance in Groovy by addressing performance degradation caused by the global SwitchPoint guard. When any metaclass changes, the global SwitchPoint invalidates ALL call sites, causing severe performance issues in frameworks like Grails that frequently modify metaclasses.
Changes:
- Disabled the global SwitchPoint guard by default (can be re-enabled via system property)
- Implemented call site registration using WeakReferences for targeted cache invalidation
- Added cache clearing mechanism that resets call sites to default targets when metaclasses change
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| Selector.java | Added system property to control SwitchPoint guard, disabled by default to avoid global invalidation overhead |
| IndyInterface.java | Implemented WeakReference-based call site registry and cache invalidation logic to handle metaclass changes without global SwitchPoint |
| CacheableCallSite.java | Added clearCache() method to discard stale method handles when metaclass changes occur |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/main/java/org/codehaus/groovy/vmplugin/v8/CacheableCallSite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/main/java/org/codehaus/groovy/vmplugin/v8/CacheableCallSite.java
Outdated
Show resolved
Hide resolved
6c6593e to
868cfc5
Compare
- Document best-effort invalidation strategy in invalidateSwitchPoints(): concurrent threads may briefly reinstall optimized targets, but method handle guards (SAME_MC, SAME_CLASS) ensure correctness by forcing fallback - Keep clearCache() volatile field access outside synchronized block for consistency with getAndPut() which also accesses it outside the lock; volatile provides sufficient visibility guarantees for this field - Add comprehensive Javadoc for INDY_SWITCHPOINT_GUARD explaining performance implications of enabling the global switchpoint guard - Document WeakReference cleanup behavior in ALL_CALL_SITES and registerCallSite() to clarify garbage collection semantics - Simplify inline comment in setGuards() with detailed Javadoc on field
868cfc5 to
8c05bd5
Compare
paulk-asert
left a comment
There was a problem hiding this comment.
I am generally in favor of this change but not for GROOVY_4_0_X in the first instance. Safest would be master only but I understand folks are still by-and-large porting to GROOVY 5 rather than looking at 6, so we might need to get it in 5 for further feedback. I still have to think further about scenarios for when it is safe vs unsafe for the proposed default settings. I am not averse to potentially eventually porting to 4 but there are lots of folks on 4 and I am unsure we understand enough which ones might be adversely affected.
|
PR against master(Groovy 6): #2377 |
This PR significantly improves invokedynamic performance in Grails 7 / Groovy 4 by implementing several key optimizations:
https://issues.apache.org/jira/browse/GROOVY-10307
apache/grails-core#15293
Using https://github.com/jamesfredley/grails7-performance-regression as a test application
10K/10K = groovy.indy.optimize.threshold/groovy.indy.fallback.threshold
Key Results
Problem Statement
Root Cause Analysis
The performance in the Grails 7 Test Application was traced to the global SwitchPoint guard in Groovy's invokedynamic implementation:
Optimizations Applied
Disabled Global SwitchPoint Guard
-Dgroovy.indy.switchpoint.guard=trueCall Site Cache Invalidation
ALL_CALL_SITESset with weak references)Optimizations Removed (No Measurable Benefit)
Testing showed these additional optimizations provide no measurable performance benefit while adding complexity:
Monomorphic Fast-Path(REMOVED)Added volatile fields (latestClassName/latestMethodHandleWrapper) toCacheableCallSiteSkips synchronization and map lookups for call sites that consistently see the same typesOptimized Cache Operations(REMOVED)Addedget()method for cache lookup without automatic putAddedputIfAbsent()for thread-safe cache populationgetAndPut()sufficientPre-Guard Method Handle Storage(REMOVED)Selector stores the method handle before argument type guards are applied (handleBeforeArgGuards)Enables potential future optimizations for direct invocationWhat Was NOT Changed
Performance Results
Test Environment
Runtime Performance Comparison (4GB Heap - Recommended)
Startup Time Comparison
Visual Performance Summary (4GB Heap)
Configuration Test Matrix
All tests performed on January 30, 2026 with JDK 25.0.2 (Corretto), 4GB heap (-Xmx4g -Xms2g). Runtime values are stabilized averages (runs 3-5).
Failed Optimization Attempts
These approaches were tried but either caused test failures, provided insufficient benefit (<6%), or made performance worse. They are documented here to prevent repeating the same mistakes.
UNSAFE - Caused Test Failures
1. Skip Metaclass Guards for Cache Hits
What it tried: Store method handle before guards, use unguarded handle for cache hits.
Why it failed: Call sites can be polymorphic (called with multiple receiver types). The unguarded handles contain
explicitCastArgumentswith embedded type casts that throwClassCastExceptionwhen different types pass through.2. Version-Based Metaclass Guards
What it tried: Use metaclass version instead of identity for guards, skip guards for "cache hits."
Why it failed: Same polymorphic call site issue - skipping guards for cache hits is fundamentally unsafe.
3. Skip Argument Guards After Warmup
What it tried: After warmup period, use handles without argument type guards.
Why it failed: Call sites can receive different argument types over their lifetime. Skipping argument guards causes
ClassCastExceptionorIncompatibleClassChangeError.4. Per-Class Metaclass Invalidation
What it tried: Only invalidate call sites for the specific class whose metaclass changed, not all call sites globally.
Why it failed: Broke mock framework tests (
MockFor,StrictExpectation). The mock framework works by temporarily replacing metaclass and expects ALL call sites to see the change immediately - not just call sites for that class.5. Monomorphic Fast-Path with Identity Check
What it tried: Store latest receiver class name and method handle for quick identity check before hash lookup.
Why it failed: UNSAFE with per-instance metaclasses. Causes NPE when
NULL_METHOD_HANDLE_WRAPPERgets cached for stale handles.INEFFECTIVE - Less Than 6% Improvement
6. Increase Cache Size (4 to 16)
What it tried: Increase the per-call-site LRU cache from 4 to 16 entries to improve polymorphic call site performance.
Result: Less than 6% improvement in Grails benchmark. The polymorphic benchmark uses way more than 16 types, so cache size doesn't help.
7. Per-Class Generation Tokens (JRuby Pattern)
What it tried: Add
classInfoVersionfield to cache entries. On each call, compare cache token with receiver class generation for cheap integer validation instead of hash lookup.Result: Less than 6% improvement. Category and mock tests still require global cache clearing anyway, so generation tokens provide little benefit.
8. Optimized Cache Operations (get/putIfAbsent methods)
What it tried: Add separate
get()andputIfAbsent()methods instead of combinedgetAndPut().Result: No measurable benefit. The overhead is elsewhere, not in cache operations.
9. Pre-Guard Method Handle Storage
What it tried: Store method handle before argument type guards are applied (
handleBeforeArgGuards,directMethodHandle) for potential future optimizations.Result: No measurable benefit. Added unused code complexity.
HARMFUL - Made Performance Worse
10. Bimorphic Inline Cache
What it tried: Add a 2-slot inline cache with identity comparison before HashMap lookup. Fast path for call sites that see 1-2 types.
Results:
Why it failed: Adds overhead on every call site access. While it helps highly polymorphic sites, it hurts monomorphic-dominant workloads like typical Grails apps.
11. Parallel Stream for Cache Invalidation (Approach 1)
What it tried: Use
parallelStream().filter().collect()with atomic swap for call site invalidation.Result: Made Grails app 18% SLOWER (167ms vs 141ms baseline).
12. Parallel Stream for Cache Invalidation (Approach 2)
What it tried: Use
parallelStream().forEach()for invalidation +removeIf()for cleanup.Result: Made Grails app 12% SLOWER (158ms vs 141ms baseline).
Why parallel streams failed: Thread coordination overhead outweighs parallelization benefit for typical call site counts (~1000-5000 sites). Sequential
removeIfhas better cache locality and no synchronization overhead.Key Learnings
Any optimization that bypasses guards based on cache state is unsafe because:
explicitCastArgumentswill throw if types don't matchSafe optimizations must:
Performance tradeoffs:
Remaining Performance Gap
The optimized indy is still ~2.3x slower than non-indy (139ms vs 60ms). The remaining gap comes from:
Potential Future Optimization Areas
Configuration Reference
System Properties
groovy.indy.optimize.thresholdgroovy.indy.fallback.thresholdgroovy.indy.callsite.cache.sizegroovy.indy.switchpoint.guardFiles Modified
Source Files
Selector.javaIndyInterface.javaCacheableCallSite.javaclearCache()method