Skip to content

[FEATURE] Support snapshot/restore #46

@msfroh

Description

@msfroh

Is your feature request related to a problem?

Currently, both snapshot and restore APIs are implemented at the cluster manager, which means that they can't be performed on clusterless nodes. For production use-cases, we need to be able to create snapshots and restore from them.

What solution would you like?

For a clusterless setup, snapshot and restore should not require coordination across nodes (since internode dependencies are exactly what we're trying to avoid). That means we need to be able to create shard-level (or at least node-level) snapshots.

Scheduling and executing snapshots should be each node's responsibility (maybe using job-scheduler, if we can run it on data nodes? Otherwise, some basic cron-like functionality shouldn't be too hard to implement). Scheduling a snapshot just means registering it as an in-progress snapshot in the node's own cluster state, which will already be handled by SnapshotShardsService. We will need to make a change to OpenSearch core to finalize the snapshot on each data node, to write snapshot metadata.

On the restore side of things, ideally we should not need to schedule a "restore job" for the "normal" case. Instead, when we start a new shard, recovering from a recent snapshot (if available) should be the default behavior. That is -- it's not a special "restore" operation, it's just shard recovery. Note that the existing structure for snapshots makes things difficult, since it's snapshot -> index -> shard. For this kind of use case, index -> shard -> snapshot makes a lot more sense. We should consider reorganizing the snapshot repository layout if needed (since this makes sense for any cloud-native effort).

What alternatives have you considered?

We need some kind of disaster recovery for clusterless setups. Implementing a new snapshot/restore mechanism from scratch probably doesn't make sense. The existing mechanism almost works, but just needs to be tweaked to let each node own the whole snapshot workflow.

Do you have any additional context?

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions