feat: BundleCE for job grouping by AcquaDiGiorgio · Pull Request #8476 · DIRACGrid/DIRAC

AcquaDiGiorgio · 2026-03-02T15:55:50Z

Summary

This "system" is comprised of 3 main components, the CE, the Service and the DB. It also has an Agent, but is not a critical piece.

It has been developed with the real CE being the AREX, but in principle it should work with any other CE that implements getJobOutput.

The main idea is to receive the jobs through the BundleCE, which contacts the BundleService to store them at the BundleDB. When a certain number of processors is reached, the BundleService sends that bundle to the Real CE.

Job status retrieval is done through the service, which obtains the BundleID of the specific JobID requested and contacts the real CE for their status.

The job output is obtained directly from the BundleCE. Each job obtains theirs without going through the BundleService.

The system

Bundle CE

The BundleCE is main piece of the puzzle. This Computing Element in charge of contacting the service to store the jobs in bundles. This is a virtual CE that serves as a "proxy" between the agent uploading the jobs and the real CE, passing through the Bundle Service.

It works the same way any other CE, and mimics the idea of the PoolCE, acting as in intermediary.

In theory, with just the Computing Element should be enough to operate this system, but due to bundle persistence issues, we need the rest of the parts. Having only the CE will also complicate things, as it would require having only a singular instance of the BundleCE class, containing the information of every bundle in memory.

Bundle DB

The BundleDB is in charge of storing the individual jobs in multiple bundles following certain rules. This lets us have a stateless system, robust between restarts or to sudden shutdowns.

This database saves plenty of information such as the job location and their outputs, the proxy's location and the CE information.

To select which bundle each job will be stored at, it matches the real CE the job it wants to submit to and checks with the CE information of every bundle stored in the DB. If there is no bundle available it creates a new one with this job in it.

The ID of the Bundle serves as the PilotStamp, as multiple jobs reach the same bundle.

---
title: BundleDB
---
erDiagram
    direction LR

    BundlesInfo {
        string BundleID PK
        int ProcessorSum
        int MaxProcessor
        string Site
        string CE
        string Queue
        text CEDict
        string TaskID
        enum Stauts
        string Site
        set Flags
        datetime FirstTimestamp
        datetime LastTimestamp
    }

    JobToBundle {
        string JobID PK
        strub BundleID FK
        int DiracID
        string ExecutablePath
        string Outupts
        int Processors
    }
    
    JobInputs {
        string InputID PK
        string JobID FK
        string InputPath
    }

    BundlesInfo ||--o{ JobToBundle : BundleID
    JobToBundle ||--o{ JobInputs : JobID

Bundle Service (Bundler)

The Service serves as the bridge between all of the components. The main tasks it manages are:

Receiving the jobs from the BundleCE
Creating bundles in the Database and adding jobs to them
Sending the job to the real CE if the bundle is ready to submission of being forced to do it

Bundle Agent (BundleManager)

The agent serves as a supplement for the system. In principle, it is not mandatory to have it, but it helps for 2 specific cases (at the time of writing).

First, stalled bundles. The bundle might be able to store up to X jobs before submission, but sometimes this number might take too much time to reach due to a low influx of jobs. This agent checks the last time a job was submitted to each bundle and forces a submission if it is taking too long.

Second, checking bundle heartbeat. When the bundle is sent, the best way of checking if it is still alive is by checking its status and reporting it to the JobDB. This could be done through the CE or service, but as the agent only gets executed once every x seconds, checking it once through the agent is much less CPU intensive than though the other options.

Known limitations

The bundle uses the Proxy of the first job stored on it.

Not a priority, as it should be the proxy of the pilot every time (as far as I'm aware).

The service and the agent submitting the jobs must be in the same machine. This is because the database stores the proxy PATH located at the machine that matched the job. So, for example, if the PushJobAgent is in the machine "A" and it sends a job, the proxy is stored at the /tmpdirectory of machine "A"; then, for this system to work, we need to setup the Bundle Service in machine "A", as it is the service the one submitting the bundle.

As storing the proxy at the DB is out of the radar, storing the DN and group of the proxy and then matching it through the ProxyManagerClient might be the best way to go.

Another possibility could be to use getRemoteCredentials from the service. I need to look into this.

The outputs are stored at /tmp/bundles (modifiable by the administrator) first and then moved to the working directory of each of the jobs. This process is painfully slow and could collapse a machine if it has a tiny partition for /tmp, which is quite common.

By changing the behaviour of getJobOutput, we might be able to let each job retrieve their output directly.

When the bundle finishes, only one of the jobs downloads the outputs, the rests wait until it finishes. The file movement is done separately by each job.

If we can change the behaviour of getJobOutput we can solve this one too.
If not, this could be a very difficult limitation to overcome.

A JobAgent might not be able to see finalised bundles, as it checks the dictionary self.taskResults of the CE instead of calling getJobOutput.

This just requires some testing, as the current idea is viable. However, I think this solution should only exist along the PushJobAgent, as pilots are already able to manage parallel job execution properly.

Until every job has finished / failed, the bundle keeps running. In a bundle with 2 jobs, if job A fails but job B does not and takes 2 hours to finish, job A will know it failed when B finishes.

At the moment, it is accepted as a forced limitation, but should addressed.

TODOs

BEGINRELEASENOTES

*Resources
NEW: BundleCE for bundled job submission.
NEW: AREXEnhancedCE for recursive job output retrieval.
FIX: Bug at AREXCE with executable name while constructing wrapperContent.

*WorkloadManagementSystem
NEW: Bundled Job submission using the new BundleDB, BundleHandler, BundleClient and BundleAgent components.

ENDRELEASENOTES

Untested code. Plus the BundleCE should change to contact the service

Improved code legibility

Not finished still

More cases must to be tested such as killing and rescheduling bundled jobs Added also the Alexandre's AREXEnhanced CE

Outputs are obtained once and the rest grab them locally

Add a debugging monitoring info (temporary)

This approach is mainly for debugging purposes

…jobs to fill the bundle Also added some functionalities for later use with an agent. These have not been tested

…bAgent

…ough the matcher

Accommodated Service and CE to the schema of the DB

AcquaDiGiorgio added 30 commits June 25, 2025 14:44

Fist idea for the bundleCE implementation

1c626b6

Fist implementation BundlerService and BundleBD

5763a9a

Untested code. Plus the BundleCE should change to contact the service

fixed sql syntax error

7ff94e3

add BundleDB integration tests

5703f81

fix errors obtained during integration tests

bddbde8

Changed TEXT datatype to VARCHAR

3dbd113

Added better templating and logging

b303723

Improved code legibility

Input files returned as list

6568659

BundlerService inserted in the ConfigTemplate

74f9159

Adapted BundleCE to the Service (untested)

34b380d

pre-commit

b928d65

BundleDB status changes

29f5585

BundlerTemplates refactor

72bb46e

Bundle - Status and task related changes

4240f6d

General changes

8f71edd

First working implementation

1605105

Not finished still

First complete version

eb12ec2

More cases must to be tested such as killing and rescheduling bundled jobs Added also the Alexandre's AREXEnhanced CE

Added a proper individual job status notification and output retrival

03b5373

Outputs are obtained once and the rest grab them locally

Setup bundled CE proxy

e227380

Add new table for long input treatment

63c2ec2

Change input insertion to DB and Bundled Job status retrieval

9453553

Remove unnecessary background process

699b66f

Add a debugging monitoring info (temporary)

Preprocess job wrapper offline at node

0fa0012

Improved output retrieval

27d7098

Added extra runner file for better control and monitroing

6d9dc67

This approach is mainly for debugging purposes

Added a timestamp to avoid bundle stalling when there are not enough …

47f950e

…jobs to fill the bundle Also added some functionalities for later use with an agent. These have not been tested

Remove unnecesary status files

f1be218

Fix job insertion in running or finished bundle

c2c6a7a

Remove testing code

502f3d7

UNTESTED: Added agent to monitor bundles

a96a870

AcquaDiGiorgio added 16 commits October 17, 2025 12:21

Modified ce.submitJob to be the same as the submission through the Jo…

2a590cb

…bAgent

Extended BundleManagerAgent ConfigTemplate

2a896fa

Changed the template to its original format

3861308

Added flags to control Bundle stages and accept the JobID obtained th…

c22ff2c

…ough the matcher

Updated agent to be able to force-submit stalled bundles

6706bde

Accommodated Service and CE to the schema of the DB

Pre-commit

874c81a

Generalize Bundle Status using PilotStatus

0f3025e

Send heartbeat to maintain bundles alive against StalledJobAgent

4317fba

fix(PushJobAgent): Bug while obtaining job output in failed job wrappers

dd11768

chore: Clean and document

232fb89

chore: Remove unnecesary _cleanFinishedBundles at BundleManagerAgent

5b04cc1

chore: Remove unnecesary procyPath at BundleDB

e0e31d6

pre-commit

97c2b24

chore: Remove ExecTemplate from BundleCE and BundleDB

48ec52c

Merge branch 'DIRACGrid:integration' into BundleCE

720baa6

chore: Remove debugging code

550a3fb

AcquaDiGiorgio self-assigned this Mar 2, 2026

aldbr linked an issue Mar 2, 2026 that may be closed by this pull request

[Feature]: Job grouping for HPCs with no external connectivity #8475

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: BundleCE for job grouping#8476

feat: BundleCE for job grouping#8476
AcquaDiGiorgio wants to merge 46 commits intoDIRACGrid:integrationfrom
AcquaDiGiorgio:BundleCE

AcquaDiGiorgio commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AcquaDiGiorgio commented Mar 2, 2026

Summary

The system

Bundle CE

Bundle DB

Bundle Service (Bundler)

Bundle Agent (BundleManager)

Known limitations

TODOs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant