Skip to content

[Feature] Add minimal AlmaLinux support (#53)#439

Open
gigabyte132 wants to merge 1 commit intoROCm:mainfrom
gigabyte132:add-almalinux-support
Open

[Feature] Add minimal AlmaLinux support (#53)#439
gigabyte132 wants to merge 1 commit intoROCm:mainfrom
gigabyte132:add-almalinux-support

Conversation

@gigabyte132
Copy link

@gigabyte132 gigabyte132 commented Feb 19, 2026

Motivation

This PR adds minimal AlmaLinux support, addressing errors like the one described here (#53):

err: OS: almalinux 8.10 (cerulean leopard) not supported. Should be one of [red hat redhat ubuntu coreos rhel]

Technical Details

This PR adds an almaCMNameMapper method, allowing the GPU operator to run on AlmaLinux nodes.

This PR does not add full support for AlmaLinux (i.e. providing the template dockerfiles), it functions in the 2 cases where we use the inbox AMD drivers or have external builds of the driver image.

Test Plan

We have tested this on AlmaLinux 9.5 as well as AlmaLinux 10.1 successfuly with an external driver container build (30.20) as well as the inbox drivers. Please do let me know if you would like me to add additional tests.

Test Result

The GPU operator exposes the AMD GPUs as expected on nodes running AlmaLinux 9.5 as well as AlmaLinux 10.1.

Submission Checklist

Signed-off-by: Raulian-Ionut Chiorescu <raulian-ionut.chiorescu@cern.ch>
@gigabyte132 gigabyte132 force-pushed the add-almalinux-support branch from 3193edc to b82e56a Compare February 19, 2026 10:54
@yansun1996
Copy link
Member

Hi @gigabyte132 thanks for raising this PR, would you mind:

  1. continue to check whether these are working fine ? metrics exporter, as well as device plugin with some workload. (We have workload YAML provided in example folder) those can be quickly tested manually. We need to confirm those minimum things working on your side, since ROCm officially hasn't claimed support for Alma Linux yet.

  2. what is the specific use case and cluster size on your side ?

we would evaluate your PR.

@gigabyte132
Copy link
Author

Hi @yansun1996 , apologies for the delay.

The workloads in the example folder all seem to execute successfully besides the alexnet training which seems to be due to an incompatibility from the latest version of Keras, so seems unrelated.

Traceback (most recent call last):
  File "/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 25, in <module>
    import benchmark_cnn
  File "/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in <module>
    from models import model_config
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/model_config.py", line 31, in <module>
    from models.experimental import deepspeech
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 121, in <module>
    class DeepSpeech2Model(model_lib.Model):
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 126, in DeepSpeech2Model
    'lstm': tf.nn.rnn_cell.BasicLSTMCell,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/util/lazy_loader.py", line 207, in __getattr__
    raise AttributeError(
AttributeError: `BasicLSTMCell` is not available with Keras 3.

Both the metrics exporter, and device plugin seem to function normally.

  1. The use-case is the following:

We have a cluster that we have complete control over the underlying OS and any other components, this would be only 8 AMD nodes, a couple 2x8 MI300X and 6x4 W7900 Radeon Pros. Within our organization we have received a pledge of resources from a few other teams that we will use as bursting capacity. For these nodes we don't have control over the underlying OS and feels more like a "managed" k8s cluster. We are starting a PoC with 48 GPUs, and if successful it will grow to around the order of a couple thousand. The issue is that these nodes either come with alma9 or alma10 as the underlying OS. So this is why we would need Alma support for the gpu-operator.

Let me know if you need anything else from me, happy to help/contribute.

bhatnitish added a commit to bhatnitish/rocm-gpu-operator that referenced this pull request Mar 3, 2026
* [RELEASE] Update test runner and utils container default image URL (ROCm#435)

* Update utils container default image URL

* Update test runner container default image URL

* Update default image for exporter and beta test runner (ROCm#436)

* [RELEASE] Update default images in helm charts (ROCm#438)

* enable health pulse for kmm from operator (ROCm#439)

* Keep some image using internal registry URL for internal repo

* Update default device plugin image to latest tag (ROCm#442)

* Update default device plugin image to latest tag

* Fix unit test

---------

Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com>
Co-authored-by: Nitish Bhat <bhatnitish@gmail.com>
@yansun1996
Copy link
Member

yansun1996 commented Mar 5, 2026

Hi @yansun1996 , apologies for the delay.

The workloads in the example folder all seem to execute successfully besides the alexnet training which seems to be due to an incompatibility from the latest version of Keras, so seems unrelated.

Traceback (most recent call last):
  File "/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 25, in <module>
    import benchmark_cnn
  File "/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in <module>
    from models import model_config
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/model_config.py", line 31, in <module>
    from models.experimental import deepspeech
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 121, in <module>
    class DeepSpeech2Model(model_lib.Model):
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 126, in DeepSpeech2Model
    'lstm': tf.nn.rnn_cell.BasicLSTMCell,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/util/lazy_loader.py", line 207, in __getattr__
    raise AttributeError(
AttributeError: `BasicLSTMCell` is not available with Keras 3.

Both the metrics exporter, and device plugin seem to function normally.

  1. The use-case is the following:

We have a cluster that we have complete control over the underlying OS and any other components, this would be only 8 AMD nodes, a couple 2x8 MI300X and 6x4 W7900 Radeon Pros. Within our organization we have received a pledge of resources from a few other teams that we will use as bursting capacity. For these nodes we don't have control over the underlying OS and feels more like a "managed" k8s cluster. We are starting a PoC with 48 GPUs, and if successful it will grow to around the order of a couple thousand. The issue is that these nodes either come with alma9 or alma10 as the underlying OS. So this is why we would need Alma support for the gpu-operator.

Let me know if you need anything else from me, happy to help/contribute.

Hi @gigabyte132 , thanks for the reply !

I just checked and retried with your PR, looks like if you specify spec.driver.disable=false, it would by default bypass the checking of the OS name from node status, then it would by default work for bringing up other components.

So without this PR, did you hit any error message even if you configured spec.driver.enable=true ?

that would help determine whether you can continue without this PR or not

@gigabyte132
Copy link
Author

spec.driver.enable=true

Hello @yansun1996

In the current configuration we already had spec.driver.enable=true and we ran into the error message described in #53 (err: OS: almalinux 8.10 (cerulean leopard) not supported. Should be one of [red hat redhat ubuntu coreos rhel])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants