[SPARK-55998][SHS] Synchronize more places on accessing SHS listing.db#54817
Open
pan3793 wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-55998][SHS] Synchronize more places on accessing SHS listing.db#54817pan3793 wants to merge 1 commit intoapache:masterfrom
pan3793 wants to merge 1 commit intoapache:masterfrom
Conversation
pan3793
commented
Mar 16, 2026
|
|
||
| // If the number of files is bigger than MAX_LOG_NUM, | ||
| // clean up all completed attempts per application one by one. | ||
| val num = KVUtils.size(listing.view(classOf[LogInfo]).index("lastProcessed")) |
Member
Author
There was a problem hiding this comment.
NoSuchElementException is thrown from here, because the LevelDB/RocksDB-based KVStore does not support MVCC, we must use a relatively heavy synchronized to fix the concurrency issue.
Member
Author
|
@sarutak @LuciferYang @dongjoon-hyun could you please take a look? |
dongjoon-hyun
approved these changes
Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Wrap all
listing.delete(classOf[LogInfo], <path>)andlisting.view(classOf[LogInfo])operation in SHS's
FsHistoryProviderwithlisting.synchronized { ... }, this is similar to SPARK-37659.Why are the changes needed?
With the above configs, we found that the real preserved number of eventlogs sometimes can exceed 1 million, which reaches the number of files limitation of the HDFS single folder, and then causes the inability to submit new Spark apps (fails due to being unable to create event log file).
After digging the SHS's logs (it's an internal version based on OSS Spark 3.3.4, after taking a look, the master code should have the same issue), we found that in most cases, the scheduled cleanLogs task will fail with
and the failure causes a significant backlog in the event log folders
Does this PR introduce any user-facing change?
Yes, this fixes a bug that SHS may not clean up expired event logs in time.
How was this patch tested?
I have rolled out the patched SHS online for a few days, and am monitoring logs, it has no failures in the cleaning up phase anymore.
Was this patch authored or co-authored using generative AI tooling?
No.