[Feature] Add minimal AlmaLinux support (#53) by gigabyte132 · Pull Request #439 · ROCm/gpu-operator

gigabyte132 · 2026-02-19T10:52:23Z

Motivation

This PR adds minimal AlmaLinux support, addressing errors like the one described here (#53):

err: OS: almalinux 8.10 (cerulean leopard) not supported. Should be one of [red hat redhat ubuntu coreos rhel]

Technical Details

This PR adds an almaCMNameMapper method, allowing the GPU operator to run on AlmaLinux nodes.

This PR does not add full support for AlmaLinux (i.e. providing the template dockerfiles), it functions in the 2 cases where we use the inbox AMD drivers or have external builds of the driver image.

Test Plan

We have tested this on AlmaLinux 9.5 as well as AlmaLinux 10.1 successfuly with an external driver container build (30.20) as well as the inbox drivers. Please do let me know if you would like me to add additional tests.

Test Result

The GPU operator exposes the AMD GPUs as expected on nodes running AlmaLinux 9.5 as well as AlmaLinux 10.1.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Raulian-Ionut Chiorescu <raulian-ionut.chiorescu@cern.ch>

yansun1996 · 2026-02-19T11:21:06Z

Hi @gigabyte132 thanks for raising this PR, would you mind:

continue to check whether these are working fine ? metrics exporter, as well as device plugin with some workload. (We have workload YAML provided in example folder) those can be quickly tested manually. We need to confirm those minimum things working on your side, since ROCm officially hasn't claimed support for Alma Linux yet.
what is the specific use case and cluster size on your side ?

we would evaluate your PR.

gigabyte132 · 2026-03-02T14:47:52Z

Hi @yansun1996 , apologies for the delay.

The workloads in the example folder all seem to execute successfully besides the alexnet training which seems to be due to an incompatibility from the latest version of Keras, so seems unrelated.

Traceback (most recent call last):
  File "/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 25, in <module>
    import benchmark_cnn
  File "/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in <module>
    from models import model_config
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/model_config.py", line 31, in <module>
    from models.experimental import deepspeech
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 121, in <module>
    class DeepSpeech2Model(model_lib.Model):
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 126, in DeepSpeech2Model
    'lstm': tf.nn.rnn_cell.BasicLSTMCell,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/util/lazy_loader.py", line 207, in __getattr__
    raise AttributeError(
AttributeError: `BasicLSTMCell` is not available with Keras 3.

Both the metrics exporter, and device plugin seem to function normally.

The use-case is the following:

We have a cluster that we have complete control over the underlying OS and any other components, this would be only 8 AMD nodes, a couple 2x8 MI300X and 6x4 W7900 Radeon Pros. Within our organization we have received a pledge of resources from a few other teams that we will use as bursting capacity. For these nodes we don't have control over the underlying OS and feels more like a "managed" k8s cluster. We are starting a PoC with 48 GPUs, and if successful it will grow to around the order of a couple thousand. The issue is that these nodes either come with alma9 or alma10 as the underlying OS. So this is why we would need Alma support for the gpu-operator.

Let me know if you need anything else from me, happy to help/contribute.

* [RELEASE] Update test runner and utils container default image URL (ROCm#435) * Update utils container default image URL * Update test runner container default image URL * Update default image for exporter and beta test runner (ROCm#436) * [RELEASE] Update default images in helm charts (ROCm#438) * enable health pulse for kmm from operator (ROCm#439) * Keep some image using internal registry URL for internal repo * Update default device plugin image to latest tag (ROCm#442) * Update default device plugin image to latest tag * Fix unit test --------- Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com> Co-authored-by: Nitish Bhat <bhatnitish@gmail.com>

yansun1996 · 2026-03-05T00:56:12Z

Hi @yansun1996 , apologies for the delay.

The workloads in the example folder all seem to execute successfully besides the alexnet training which seems to be due to an incompatibility from the latest version of Keras, so seems unrelated.
Traceback (most recent call last):
  File "/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 25, in <module>
    import benchmark_cnn
  File "/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in <module>
    from models import model_config
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/model_config.py", line 31, in <module>
    from models.experimental import deepspeech
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 121, in <module>
    class DeepSpeech2Model(model_lib.Model):
  File "/benchmarks/scripts/tf_cnn_benchmarks/models/experimental/deepspeech.py", line 126, in DeepSpeech2Model
    'lstm': tf.nn.rnn_cell.BasicLSTMCell,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/util/lazy_loader.py", line 207, in __getattr__
    raise AttributeError(
AttributeError: `BasicLSTMCell` is not available with Keras 3.
Both the metrics exporter, and device plugin seem to function normally.

The use-case is the following:

We have a cluster that we have complete control over the underlying OS and any other components, this would be only 8 AMD nodes, a couple 2x8 MI300X and 6x4 W7900 Radeon Pros. Within our organization we have received a pledge of resources from a few other teams that we will use as bursting capacity. For these nodes we don't have control over the underlying OS and feels more like a "managed" k8s cluster. We are starting a PoC with 48 GPUs, and if successful it will grow to around the order of a couple thousand. The issue is that these nodes either come with alma9 or alma10 as the underlying OS. So this is why we would need Alma support for the gpu-operator.

Let me know if you need anything else from me, happy to help/contribute.

Hi @gigabyte132 , thanks for the reply !

I just checked and retried with your PR, looks like if you specify spec.driver.disable=false, it would by default bypass the checking of the OS name from node status, then it would by default work for bringing up other components.

So without this PR, did you hit any error message even if you configured spec.driver.enable=true ?

that would help determine whether you can continue without this PR or not

gigabyte132 · 2026-03-05T08:06:27Z

spec.driver.enable=true

Hello @yansun1996

In the current configuration we already had spec.driver.enable=true and we ran into the error message described in #53 (err: OS: almalinux 8.10 (cerulean leopard) not supported. Should be one of [red hat redhat ubuntu coreos rhel])

[Feature] Add minimal AlmaLinux support (ROCm#53)

b82e56a

Signed-off-by: Raulian-Ionut Chiorescu <raulian-ionut.chiorescu@cern.ch>

gigabyte132 force-pushed the add-almalinux-support branch from 3193edc to b82e56a Compare February 19, 2026 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add minimal AlmaLinux support (#53)#439

[Feature] Add minimal AlmaLinux support (#53)#439
gigabyte132 wants to merge 1 commit intoROCm:mainfrom
gigabyte132:add-almalinux-support

gigabyte132 commented Feb 19, 2026 •

edited

Loading

Uh oh!

yansun1996 commented Feb 19, 2026

Uh oh!

gigabyte132 commented Mar 2, 2026

Uh oh!

yansun1996 commented Mar 5, 2026 •

edited

Loading

Uh oh!

gigabyte132 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gigabyte132 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

yansun1996 commented Feb 19, 2026

Uh oh!

gigabyte132 commented Mar 2, 2026

Uh oh!

yansun1996 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gigabyte132 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gigabyte132 commented Feb 19, 2026 •

edited

Loading

yansun1996 commented Mar 5, 2026 •

edited

Loading