Feature/external deployment replication by nouseforaname · Pull Request #109 · cloudfoundry/pxc-release

nouseforaname · 2026-04-30T07:20:23Z

Hi there, I'd like to be able to do replica setups with the release. Here's my idea, let me know what you think of the approach.

Feature or Bug Description

it adds a jobs and a few config options to enable cross deployment replication for the pxc release.

Motivation

I want continuous backups of my running pxc cluster

Related Issue

If this PR was first opened as an issue, please provide the link to that issue here.

this enables the required configuration to use a deployment of this release as a replica source or target. to enable gtid based replication, the server_ids across all nodes nned to be unique to achieve this the templates will render server_id using the bosh provided `spec.index` which should be unique across all instances. In case the rendered instance is a replica source the id will start counting from the lowest possible value ( which is `1` because setting server_id to `0` deactivates replication for the node.. ). in case of rendering a replica target, the indexing will start at the highest possible value, this should ensure that the server_id across each part of the replica setup do not overlap.

- handle initial dump - handle refresh dumps when we find that new tables are added ( is there a better way? just the single tables?) - log replication status and debug info if found that something is wrong

- more explicit logging around the current state - add post start to determine if the bpm process succesfully started up by parsing it's log output fly-by: - fix double `--defaults-file` param in `source_dump` var.

abg · 2026-05-01T17:27:11Z

Thanks! I will try to take a look a this and provide some feedback by Monday.

fixes: ``` 1) db_init template specifying users via various spec properties generates a valid db_init file Failure/Error: expect(rendered_template).to eq File.read(File.join(dir, "db_init_all_features")) expected: "SET @@session.sql_log_bin = off;\n\nDROP USER IF EXISTS 'root'@'localhost';\nDROP USER IF EXISTS 'ro... PROXY ON ''@'' TO 'special-admin-user'@'%' WITH GRANT OPTION;\n\nSET @@session.sql_log_bin = on;\n" got: "SET @@session.sql_log_bin = off;\n\nDROP USER IF EXISTS 'root'@'localhost';\nDROP USER IF EXISTS 'ro... PROXY ON ''@'' TO 'special-admin-user'@'%' WITH GRANT OPTION;\n\nSET @@session.sql_log_bin = on;\n" (compared using ==) Diff: @@ -12,6 +12,7 @@ DROP USER IF EXISTS 'root'@'%'; DROP USER IF EXISTS 'roadmin'@'%'; + CREATE USER IF NOT EXISTS 'mysql-backup'@'localhost'; ALTER USER 'mysql-backup'@'localhost' IDENTIFIED WITH caching_sha2_password BY 'secret-backup-pw'; GRANT RELOAD, LOCK TABLES, REPLICATION SLAVE, REPLICATION CLIENT, /*!80001 BACKUP_ADMIN,*/ PROCESS ON *.* to 'mysql-backup'@'localhost'; # ./spec/pxc-mysql/db_init_spec.rb:147:in `block (3 levels) in <top (required)>' ```

abg

I reviewed this and left a trail of comments.

Beyond the concerns already noted, I would like to see some answer for Galera <-> Galera replication - even if that's simply "replica must only be a single node for now". I.e. I think it would be fine for a template in the pxc-replicator job to abort if the numbers of MySQL nodes != 1. Robustly handling cluster to cluster async replication could be interesting too, but would be much more work.

I would also encourage pxc-replicator to be written in a Go module that is more robust and easier to test and maintain than the current bash scripting.

abg · 2026-05-05T14:46:57Z

+
+log "checking replication status"
+
+if [[ ! is_replication_configured ]]; then


I think you mean to say:

if ! is_replication_configured; then ... fi

Otherwise you're evaluating a string.

abg · 2026-05-05T14:47:26Z

+
+  log "replication is not configured. enabling"
+  OUT=$(enable_replication);
+  if [[ $(enable_replication) !=  "" ]]; then


Do you intend to call enable_replication twice here?

abg · 2026-05-05T14:50:44Z

+  log "resyncing with full backup"
+  log $($self_mysql <<< "STOP REPLICA;")
+  log $($self_mysql <<< "RESET REPLICA;")
+  log $($self_mysql <<< "RESET MASTER;")


Just a note: RESET MASTER is deprecated in v8.4 and removed in MySQL v9, replaced with RESET BINARY LOGS AND GTIDS.

I understand this is likely used here since pxc-release still supports MySQL v8.0 which still requires this syntax.

yeah, what's the preferred way of handling this?

I noticed while reading docs that they're still cleaning up the weirdo naming..
I tried already using the replica/source naming as far as possible but didn't know if it would be idempotent :/

https://dev.mysql.com/doc/refman/8.4/en/replication-howto-repuser.html

it seems that the permissions still does not have an alternative?

Yeah, v8.0 (and earlier) only support RESET MASTER and v8.4 only supports RESET BINARY LOGS ...

We've typically dealt with this kind of thing with version checks:

if mysql_major_minor() == "8.0": # <- Delete branch when we don't care about old version anymore do_legacy_sql_stuff() else: do_latest_sql_stuff() end

That could potentially be done at template rendering time by looking at the p('mysql_version') property.

MySQL v8.0 is EOL (although we'll probably get one more patch release from Percona for v8.0.46), so there is an argument to only support v8.4 and onwards. Maybe guard against using this feature on v8.0?

# In some pxc-replicator template <%- if p('mysql_version') == "8.0" -%> raise "Unsupported <helpful error message>" <%- end -%>

hmm. How'd that play with cf-depl?

It seems that 8.4 is still experimental cloudfoundry/cf-deployment@3870c60

I'm not sure if there are any reasons why it's still setup to default to 8.0 outside of being cautious.

I'm going to ask around what the plans for runtime are in regards to switching the default. For now I guess I'll try to find all relevant places and sprinkle some ifs

abg · 2026-05-05T14:53:22Z

+      "private_key": "..."
+    }
+    default: {}
+  config.source.tls:


This property is duplicated twice.

abg · 2026-05-05T14:53:58Z

+      "private_key": "..."
+    }
+    default: {}
+  config.tls.enabled:


Is there a reason not to use TLS? I would opt towards enforcing it and requiring certificates to be configured.

yeah. Mostly my inexperience around databases.

Do you want to use it just for encryption? or also auth? while I did find the tls.client and the certs that can be created in pxc-mysql I didn't know what to do with it because I have not found a reference for it's usage in cf-deployment..

https://github.com/search?q=repo%3Acloudfoundry%2Fcf-deployment+path%3A%2F%5Eoperations%5C%2F%2F+%22database%22&type=code

having said that, I already started looking into it and my understanding of tls.client is that it can be used to encrypt the connection but the cert itself is not used to authenticate any database user?

Are you looking to achieve something like:

CREATE USER user1 IDENTIFIED BY 'Passw0rd!' REQUIRE SUBJECT '/C=BE/ST=Hainaut/L=Maurage/O=MySQL/OU=Community/CN=user1/emailAddress=user1@oracle.com';

or

do you want to get rid of Passw0rd alltogether?

I would encourage the replication channel to be encrypted. I think requiring that in 2026 is a good security stance.

So CHANGE REPLICATION SOURCE ... SOURCE_SSL = 1, SOURCE_SSL_CA = '/path/to/client-ca.pem', SOURCE_SSL_VERIFY_SERVER_CERT = 1. This might be the tls.client.ca on pxc-mysql, configured appropraite (i.e. referencing /var/vcap/jobs/pxc-mysql/certificates/client-ca.pem).

Mutual TLS via x509 is not necessary for this PR, but maybe useful in the future.

Yeah I already started adding commits for just the encryption layer yesterday. as fas as I can tell that should be enabled now.

abg · 2026-05-05T15:23:34Z

+  log $($self_mysql <<< "STOP REPLICA;")
+  log $($self_mysql <<< "RESET REPLICA;")
+  log $($self_mysql <<< "RESET MASTER;")
+  log $($source_dump --all-databases --triggers --routines  --single-transaction | $self_mysql)


Might need set -o pipefail or check $PIPESTATUS for failures here.

Additionally I will note: mysqldump is okay for small databases, but I might explore something like the mysql clone plugin to handle replicating non-trivial datasets.

I unfortunately have no idea what constitutes small in the database world.

what is a realistical size limit for small?

To be clear - mysql_clone is not something I would expect for this PR - I was mentioning it as an alternative for the future. I think mysqldump is fine for an initial PR here.

In general, I would say database sizes >20-30GB probably warrant something other than mysqldump. mysql_clone (or other physical backup / restore approaches) are about an order of magnitude (or two) faster for large database sizes depending on the hardware limits.

abg · 2026-05-05T15:24:21Z

+
+while [[ $SECONDS -lt <%= p('monit_startup_timeout') %> ]]; do
+  if grep "$(date +%Y-%m-%dT%H:%M).*replication healthy" /var/vcap/sys/log/pxc-replicator/pxc-replicator.stdout.log > /dev/null; then
+    log "replication reported healthy within the laset minute"


typo: laset => last

abg · 2026-05-05T15:25:08Z

+  "-o operations/update-vm-type.yml"
+	"-o operations/update-persistent-disk-size.yml"
+	"-o operations/update-network.yml"
+	"-o operations/update-vm-type.yml"


The operations/update-vm-type.yml ops-file shows up twice.

abg · 2026-05-05T15:53:25Z

+server_id = p('engine_config.server_id')
+# ensure we're not starting at index 0. replication is disabled when it is set to `0`
+# it just needs to be unique
+# https://dev.mysql.com/doc/refman/8.4/en/replication-options.html
+if p('engine_config.enable_replication_source')
+  server_id = spec.index + 1
+end
+# if where the replication target, count down from highest value to avoid overlaps between source and target
+if p('engine_config.enable_replication_target')
+  server_id = 4294967296 - spec.index - 1 - 1 # max value is `2^32 - 1`. spec.index can be 0.
+end


This is a bit messy and does not support multiple replica deployments given the static "base" server-id.

In general, Percona recommends all instances within a Galera deployment have the same server-id. So a primary cluster may have server_id: 1 and an (asynchronous) replica cluster may have server_id: 2

Galera <-> Galera async replication does get complicated. You might read some of the Percona blog posts like:

Using PXC Replication Manager to Auto Manage Both Source and Replica Failover in Galera-Based Environments

yes and I got very confused. thanks for pointing that out.

abg · 2026-05-05T16:07:26Z

+SECONDS=0
+
+while [[ $SECONDS -lt <%= p('monit_startup_timeout') %> ]]; do
+  if grep "$(date +%Y-%m-%dT%H:%M).*replication healthy" /var/vcap/sys/log/pxc-replicator/pxc-replicator.stdout.log > /dev/null; then


This seems a little brittle. I would probably actually query the replica and validate the IO_Thread and SQL_Thread are in a healthy state.

some old deployments may want to opt to use the replication feature without having native access to it. one example would be cf-deployment. Operators could opt to manually onboard their existing cf deployment (with an old version of the pxc-release) into a replica. but that requires patching the permissions of the roadmin user. providing this source_admin creds will patch the roadmin user to be usable for a replication setup. this also reworks the config to enable providing manual values instead of forcing link usage. rename the link in the pxc-job and the replicator job to be less confusing. this should help disambiguate manual linking in bosh manifests.

the link naming changed, the ops files did not adjust for that.

it seems that was unnecessary as pointed out in the review. the server_id need to be different in between source and target, not different for each node within source or target.

nouseforaname · 2026-05-06T13:00:12Z

first of all, sorry for the horrible PR ( I'm going to excuse myself by having had to learn a lot about pxc and mysql) and second thank you for the in depth review.

I marked this as draft for now. I set up ci to deploy the current state of the release and iterate on it until this is ready.

As far as: should this be go code? Yes :). I think I misunderstood you when I asked in slack about if this should be added to galera-agent and I read you answer as don't add go code totally on me. Having said that, I'll need some time to get this done.

There's one thing that I already can answer, my use case is definitely running an external single instance replica to offload backups from a cf-deployment cluster. I'd like to avoid trying to create the everything PR but I'm happy to iteratively improve until we get there?

My suggestion would be this, I'd like to keep the bash for now, until the general workflow of steps taken in the configuration process is reasonable. once we're there, I'll create a go implementation instead of the bash madness?

Again, thank you for the feedback.

this adds the property parsing and link property parsing to optionally enable encryption for the mysql connections

remove it

abg · 2026-05-06T15:58:46Z

I'd like to avoid trying to create the everything PR but I'm happy to iteratively improve until we get there?

Yep, apologies for any confusing feedback. Some of my comments were "nice-to-have-for-the-future" and I didn't make that super obvious. I wouldn't expect a perfect implementation in this initial PR.

Targeting just cf-deployment use cases, mysqldump is probably fine and a single replica is fine.

I think requiring TLS for the MySQL replication channel (and mysqldump / provisioning channel) would be ideal - and at least TLS being a supported configuration. However, mutual TLS and other advanced use cases are not necessary.

My suggestion would be this, I'd like to keep the bash for now, until the general workflow of steps taken in the configuration process is reasonable. once we're there, I'll create a go implementation instead of the bash madness?

Yep, I think that's fine for now. A Go implementation doesn't need to be part of this PR, but I think it's a good end state for testability and alignment with other tooling in the release.

these are required

the outupt was already cached in the line before.

this files missed adjustments after it got created. it was supposed to set the disk size..

the bpm yml was copied over from pxc-mysql and missed removing some of the mounted paths. - certificates folde is not required at all. the mylogin.cf contains a socket for the local connection, so we still need sys/run for now.

we just need read. - /var/vcap/sys/run/pxc-mysql for socket access - /var/vcap/jobs/pxc-mysql/config mylogin.cnf - /var/vcap/jobs/pxc-replicator/certificates certs for remote / source tls

nouseforaname added 4 commits April 24, 2026 12:31

wip: add job to handle replication

d4c3577

- handle initial dump - handle refresh dumps when we find that new tables are added ( is there a better way? just the single tables?) - log replication status and debug info if found that something is wrong

chore: add dev helpers, remove before pr

5df5c33

feature: add failsafes

f8dfd0a

- more explicit logging around the current state - add post start to determine if the bpm process succesfully started up by parsing it's log output fly-by: - fix double `--defaults-file` param in `source_dump` var.

cf-foundation-community-automation Bot added this to Foundational Infrastructure Working Group Apr 30, 2026

cf-foundation-community-automation Bot moved this to Inbox in Foundational Infrastructure Working Group Apr 30, 2026

nouseforaname force-pushed the feature/external-deployment-replication branch from c85066e to 1f0db8a Compare May 4, 2026 14:34

nouseforaname force-pushed the feature/external-deployment-replication branch from 1f0db8a to a83d48a Compare May 4, 2026 14:50

abg reviewed May 5, 2026

View reviewed changes

nouseforaname added 3 commits May 6, 2026 12:12

chore: update ops files for testing replicas

9a03023

the link naming changed, the ops files did not adjust for that.

review: remove server_id handling

57f4179

it seems that was unnecessary as pointed out in the review. the server_id need to be different in between source and target, not different for each node within source or target.

nouseforaname marked this pull request as draft May 6, 2026 12:53

nouseforaname added 2 commits May 6, 2026 16:56

review: use tls for encrypting connections

85edbc4

this adds the property parsing and link property parsing to optionally enable encryption for the mysql connections

review: duplicate ops file

c7e6807

remove it

nouseforaname added 8 commits May 7, 2026 07:26

fix: forgot to commit cert templates

b35b8bf

these are required

chore: remove unnecessary 2nd call

a02db63

the outupt was already cached in the line before.

review: bad copy pasting.

18edd0e

this files missed adjustments after it got created. it was supposed to set the disk size..

review: remove dupe

01ef8f5

review: wrong war name used.

6f04ee5

review: clean out unused mount / write access

1aa3c6a

the bpm yml was copied over from pxc-mysql and missed removing some of the mounted paths. - certificates folde is not required at all. the mylogin.cf contains a socket for the local connection, so we still need sys/run for now.

review: use already existing ops file

40bcb0c

review: mark all mounts as writeable false

02c0ae7

we just need read. - /var/vcap/sys/run/pxc-mysql for socket access - /var/vcap/jobs/pxc-mysql/config mylogin.cnf - /var/vcap/jobs/pxc-replicator/certificates certs for remote / source tls

nouseforaname force-pushed the feature/external-deployment-replication branch from 0fd7129 to 02c0ae7 Compare May 7, 2026 06:11


		log "checking replication status"

		if [[ ! is_replication_configured ]]; then

Conversation

nouseforaname commented Apr 30, 2026

Feature or Bug Description

Motivation

Related Issue

Uh oh!

abg commented May 1, 2026

Uh oh!

abg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nouseforaname May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nouseforaname May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nouseforaname commented May 6, 2026

Uh oh!

abg commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nouseforaname May 6, 2026 •

edited

Loading

nouseforaname May 6, 2026 •

edited

Loading