[Feature] Add minimal AlmaLinux support (#53)#439
[Feature] Add minimal AlmaLinux support (#53)#439gigabyte132 wants to merge 1 commit intoROCm:mainfrom
Conversation
Signed-off-by: Raulian-Ionut Chiorescu <raulian-ionut.chiorescu@cern.ch>
3193edc to
b82e56a
Compare
|
Hi @gigabyte132 thanks for raising this PR, would you mind:
we would evaluate your PR. |
|
Hi @yansun1996 , apologies for the delay. The workloads in the example folder all seem to execute successfully besides the alexnet training which seems to be due to an incompatibility from the latest version of Keras, so seems unrelated. Both the metrics exporter, and device plugin seem to function normally.
We have a cluster that we have complete control over the underlying OS and any other components, this would be only 8 AMD nodes, a couple 2x8 MI300X and 6x4 W7900 Radeon Pros. Within our organization we have received a pledge of resources from a few other teams that we will use as bursting capacity. For these nodes we don't have control over the underlying OS and feels more like a "managed" k8s cluster. We are starting a PoC with 48 GPUs, and if successful it will grow to around the order of a couple thousand. The issue is that these nodes either come with alma9 or alma10 as the underlying OS. So this is why we would need Alma support for the gpu-operator. Let me know if you need anything else from me, happy to help/contribute. |
* [RELEASE] Update test runner and utils container default image URL (ROCm#435) * Update utils container default image URL * Update test runner container default image URL * Update default image for exporter and beta test runner (ROCm#436) * [RELEASE] Update default images in helm charts (ROCm#438) * enable health pulse for kmm from operator (ROCm#439) * Keep some image using internal registry URL for internal repo * Update default device plugin image to latest tag (ROCm#442) * Update default device plugin image to latest tag * Fix unit test --------- Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com> Co-authored-by: Nitish Bhat <bhatnitish@gmail.com>
Hi @gigabyte132 , thanks for the reply ! I just checked and retried with your PR, looks like if you specify So without this PR, did you hit any error message even if you configured that would help determine whether you can continue without this PR or not |
Hello @yansun1996 In the current configuration we already had |
Motivation
This PR adds minimal AlmaLinux support, addressing errors like the one described here (#53):
err: OS: almalinux 8.10 (cerulean leopard) not supported. Should be one of [red hat redhat ubuntu coreos rhel]
Technical Details
This PR adds an
almaCMNameMappermethod, allowing the GPU operator to run on AlmaLinux nodes.This PR does not add full support for AlmaLinux (i.e. providing the template dockerfiles), it functions in the 2 cases where we use the inbox AMD drivers or have external builds of the driver image.
Test Plan
We have tested this on AlmaLinux 9.5 as well as AlmaLinux 10.1 successfuly with an external driver container build (30.20) as well as the inbox drivers. Please do let me know if you would like me to add additional tests.
Test Result
The GPU operator exposes the AMD GPUs as expected on nodes running AlmaLinux 9.5 as well as AlmaLinux 10.1.
Submission Checklist