Skip to content

feat: BundleCE for job grouping#8476

Draft
AcquaDiGiorgio wants to merge 46 commits intoDIRACGrid:integrationfrom
AcquaDiGiorgio:BundleCE
Draft

feat: BundleCE for job grouping#8476
AcquaDiGiorgio wants to merge 46 commits intoDIRACGrid:integrationfrom
AcquaDiGiorgio:BundleCE

Conversation

@AcquaDiGiorgio
Copy link

See #8475

Summary

This "system" is comprised of 3 main components, the CE, the Service and the DB. It also has an Agent, but is not a critical piece.

It has been developed with the real CE being the AREX, but in principle it should work with any other CE that implements getJobOutput.

The main idea is to receive the jobs through the BundleCE, which contacts the BundleService to store them at the BundleDB. When a certain number of processors is reached, the BundleService sends that bundle to the Real CE.

Job status retrieval is done through the service, which obtains the BundleID of the specific JobID requested and contacts the real CE for their status.

The job output is obtained directly from the BundleCE. Each job obtains theirs without going through the BundleService.

The system

Bundle CE

The BundleCE is main piece of the puzzle. This Computing Element in charge of contacting the service to store the jobs in bundles. This is a virtual CE that serves as a "proxy" between the agent uploading the jobs and the real CE, passing through the Bundle Service.

It works the same way any other CE, and mimics the idea of the PoolCE, acting as in intermediary.

In theory, with just the Computing Element should be enough to operate this system, but due to bundle persistence issues, we need the rest of the parts. Having only the CE will also complicate things, as it would require having only a singular instance of the BundleCE class, containing the information of every bundle in memory.

Bundle DB

The BundleDB is in charge of storing the individual jobs in multiple bundles following certain rules. This lets us have a stateless system, robust between restarts or to sudden shutdowns.

This database saves plenty of information such as the job location and their outputs, the proxy's location and the CE information.

To select which bundle each job will be stored at, it matches the real CE the job it wants to submit to and checks with the CE information of every bundle stored in the DB. If there is no bundle available it creates a new one with this job in it.

The ID of the Bundle serves as the PilotStamp, as multiple jobs reach the same bundle.

---
title: BundleDB
---
erDiagram
    direction LR

    BundlesInfo {
        string BundleID PK
        int ProcessorSum
        int MaxProcessor
        string Site
        string CE
        string Queue
        text CEDict
        string TaskID
        enum Stauts
        string Site
        set Flags
        datetime FirstTimestamp
        datetime LastTimestamp
    }

    JobToBundle {
        string JobID PK
        strub BundleID FK
        int DiracID
        string ExecutablePath
        string Outupts
        int Processors
    }
    
    JobInputs {
        string InputID PK
        string JobID FK
        string InputPath
    }

    BundlesInfo ||--o{ JobToBundle : BundleID
    JobToBundle ||--o{ JobInputs : JobID

Loading

Bundle Service (Bundler)

The Service serves as the bridge between all of the components. The main tasks it manages are:

  • Receiving the jobs from the BundleCE
  • Creating bundles in the Database and adding jobs to them
  • Sending the job to the real CE if the bundle is ready to submission of being forced to do it

Bundle Agent (BundleManager)

The agent serves as a supplement for the system. In principle, it is not mandatory to have it, but it helps for 2 specific cases (at the time of writing).

First, stalled bundles. The bundle might be able to store up to X jobs before submission, but sometimes this number might take too much time to reach due to a low influx of jobs. This agent checks the last time a job was submitted to each bundle and forces a submission if it is taking too long.

Second, checking bundle heartbeat. When the bundle is sent, the best way of checking if it is still alive is by checking its status and reporting it to the JobDB. This could be done through the CE or service, but as the agent only gets executed once every x seconds, checking it once through the agent is much less CPU intensive than though the other options.

Known limitations

  1. The bundle uses the Proxy of the first job stored on it.

Not a priority, as it should be the proxy of the pilot every time (as far as I'm aware).

  1. The service and the agent submitting the jobs must be in the same machine. This is because the database stores the proxy PATH located at the machine that matched the job. So, for example, if the PushJobAgent is in the machine "A" and it sends a job, the proxy is stored at the /tmpdirectory of machine "A"; then, for this system to work, we need to setup the Bundle Service in machine "A", as it is the service the one submitting the bundle.

As storing the proxy at the DB is out of the radar, storing the DN and group of the proxy and then matching it through the ProxyManagerClient might be the best way to go.

Another possibility could be to use getRemoteCredentials from the service. I need to look into this.

  1. The outputs are stored at /tmp/bundles (modifiable by the administrator) first and then moved to the working directory of each of the jobs. This process is painfully slow and could collapse a machine if it has a tiny partition for /tmp, which is quite common.

By changing the behaviour of getJobOutput, we might be able to let each job retrieve their output directly.

  1. When the bundle finishes, only one of the jobs downloads the outputs, the rests wait until it finishes. The file movement is done separately by each job.

If we can change the behaviour of getJobOutput we can solve this one too.
If not, this could be a very difficult limitation to overcome.

  1. A JobAgent might not be able to see finalised bundles, as it checks the dictionary self.taskResults of the CE instead of calling getJobOutput.

This just requires some testing, as the current idea is viable. However, I think this solution should only exist along the PushJobAgent, as pilots are already able to manage parallel job execution properly.

  1. Until every job has finished / failed, the bundle keeps running. In a bundle with 2 jobs, if job A fails but job B does not and takes 2 hours to finish, job A will know it failed when B finishes.

At the moment, it is accepted as a forced limitation, but should addressed.

TODOs

  • Tests for each component
  • Code documentation
  • Improve logging
  • Add support for tokens to the BundleCE
  • Rename Service
  • Accept Job Killing
  • Remove unused parts at the BundleAgent

BEGINRELEASENOTES

*Resources
NEW: BundleCE for bundled job submission.
NEW: AREXEnhancedCE for recursive job output retrieval.
FIX: Bug at AREXCE with executable name while constructing wrapperContent.

*WorkloadManagementSystem
NEW: Bundled Job submission using the new BundleDB, BundleHandler, BundleClient and BundleAgent components.

ENDRELEASENOTES

Untested code. Plus the BundleCE should change to contact the service
Improved code legibility
Not finished still
More cases must to be tested such as killing and rescheduling bundled jobs

Added also the Alexandre's AREXEnhanced CE
Outputs are obtained once and the rest grab them locally
Add a debugging monitoring info (temporary)
This approach is mainly for debugging purposes
…jobs to fill the bundle

Also added some functionalities for later use with an agent. These have not been tested
@AcquaDiGiorgio AcquaDiGiorgio self-assigned this Mar 2, 2026
@aldbr aldbr linked an issue Mar 2, 2026 that may be closed by this pull request
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Job grouping for HPCs with no external connectivity

1 participant