-
Notifications
You must be signed in to change notification settings - Fork 182
feat(import): add script tool for multiple hbase snapshot imports) #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
26e20e4
58cb66e
66659a8
a1f9b04
49c3e28
5ec8dc1
f1125d8
e85775c
d595262
4c5446e
3fbf8aa
b7866da
aedfbb0
cc755c8
59ae558
a862ef1
1c83a18
ca455c1
1671b8d
b68498f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # HBase Snapshot Import Helper Script Usage | ||
|
|
||
| This document describes the environment variables used by the `run-snapshot-import.sh` script to automate HBase snapshot imports into Cloud Bigtable using Dataflow. | ||
|
|
||
| ## Environment Variables | ||
|
|
||
| The script relies on the following environment variables. You should set them before executing the script. | ||
|
|
||
| | Variable | Description | Example / Suggested Value | | ||
| | :--- | :--- | :--- | | ||
| | `PROJECT_ID` | The Google Cloud Project ID where the Bigtable instance and Dataflow jobs reside. | `your-project-id` | | ||
| | `INSTANCE_ID` | The Bigtable Instance ID to import data into. | `your-instance-id` | | ||
| | `BUCKET` | The GCS bucket name used for Dataflow staging, temp files, and default snapshot source path. | `your-gcs-bucket` | | ||
| | `REGION` | The GCP region to run the Dataflow jobs in. | `us-central1` | | ||
| | `TABLE_NAME` | The target Bigtable table name. | `your-table-name` | | ||
| | `SNAPSHOT_NAME` | The name of the HBase snapshot to import. | `your-snapshot-name` | | ||
| | `SNAPSHOT_SOURCE_DIR` | The GCS path where the HBase snapshot export is located. | `gs://your-gcs-bucket/snapshots` | | ||
| | `SERVICE_ACCOUNT` | The service account email to run the Dataflow jobs. | `your-service-account@developer.gserviceaccount.com` | | ||
| | `NUM_SHARDS` | The number of shards to split the import into for parallel processing. | `20` | | ||
| | `MAX_INFLIGHT_RPCS` | Maximum number of inflight RPCs for Bigtable client. | `100` | | ||
| | `BULK_MUTATION_CLOSE_TIMEOUT_MINUTES` | Timeout in minutes for closing bulk mutations. | `30` | | ||
| | `NETWORK` | VPC Network name for Dataflow workers. | `your-network` | | ||
| | `SUBNETWORK` | VPC Subnetwork name for Dataflow workers. | `regions/us-central1/subnetworks/your-subnetwork` | | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Run a specific shard range | ||
| ```bash | ||
| ./run-snapshot-import.sh <start_shard> <end_shard> | ||
| ``` | ||
| Example: `./run-snapshot-import.sh 0 5` | ||
|
|
||
| ### Run all shards (Auto-parallel mode) | ||
| ```bash | ||
| ./run-snapshot-import.sh --all | ||
| ``` | ||
| This mode will first run the restore step, and then launch background processes for all shards in parallel groups of 4 by default. | ||
|
|
||
| ## Advanced Usage | ||
|
|
||
| ### Manual Parallel Execution | ||
|
|
||
| To run shards in parallel groups (e.g., assuming 20 shards total), you can run multiple instances of this script. | ||
|
|
||
| > [!IMPORTANT] | ||
| > Shard 0 performs the restore step. You MUST run the first group (including shard 0) first and let it complete the restore step before launching other groups in parallel. Otherwise, they will fail because the restored files won't exist yet! | ||
|
|
||
| Example for manual parallel execution: | ||
| ```bash | ||
| ./run-snapshot-import.sh 0 3 & # Run this first! | ||
| # Wait for shard 0 to finish restore, then run the rest: | ||
| ./run-snapshot-import.sh 4 7 & | ||
| ./run-snapshot-import.sh 8 11 & | ||
| ./run-snapshot-import.sh 12 15 & | ||
| ./run-snapshot-import.sh 16 19 & | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### JDK Compatibility | ||
|
|
||
| If you are running on a newer JDK (like Java 21 or 26) and hit ByteBuddy errors, you can add `-Dnet.bytebuddy.experimental=true` to the `java` command lines in the script. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,142 @@ | ||
| #!/bin/bash | ||
|
|
||
| # ============================================================================== | ||
| # HBase Snapshot Import Helper Script | ||
| # ============================================================================== | ||
| # This script runs a range of Dataflow snapshot import jobs sequentially or in parallel. | ||
| # Must be executed from the 'bigtable-dataflow-parent/bigtable-beam-import' directory. | ||
| # | ||
| # For detailed usage and advanced options, see: SNAPSHOT_IMPORT_USAGE.md | ||
| # ============================================================================== | ||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Environment Variables | ||
| # ------------------------------------------------------------------------------ | ||
| # Most users will need to set these variables before running the script. | ||
| # See SNAPSHOT_IMPORT_USAGE.md for details and expected values. | ||
|
|
||
| # --- Required / Common Configurations --- | ||
| # export PROJECT_ID="your-project-id" | ||
| # export INSTANCE_ID="your-instance-id" | ||
| # export BUCKET="your-gcs-bucket" | ||
| # export REGION="us-central1" | ||
| # | ||
| # export TABLE_NAME="your-table-name" | ||
| # export SNAPSHOT_NAME="your-snapshot-name" | ||
| # export SNAPSHOT_SOURCE_DIR="gs://your-gcs-bucket/snapshots" | ||
| # export SERVICE_ACCOUNT="your-service-account" | ||
|
|
||
| # --- Sharding & Tuning --- | ||
| # export NUM_SHARDS="20" | ||
| # export MAX_INFLIGHT_RPCS="100" | ||
| # export BULK_MUTATION_CLOSE_TIMEOUT_MINUTES="30" | ||
|
|
||
| # --- Network Configurations --- | ||
| # export NETWORK="your-network" | ||
| # export SUBNETWORK="your-subnetwork" | ||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Usage | ||
| # ------------------------------------------------------------------------------ | ||
| # Usage: ./run-snapshot-import.sh <start_shard> <end_shard> | ||
| # Or: ./run-snapshot-import.sh --all | ||
| # (Runs all shards in parallel groups of 4 by default) | ||
| # | ||
| # Examples: | ||
| # ./run-snapshot-import.sh 0 3 | ||
| # ./run-snapshot-import.sh --all | ||
|
|
||
| if [ "$#" -ne 2 ] && [ "$1" != "--all" ]; then | ||
| echo "Usage: $0 <start_shard> <end_shard>" | ||
| echo " Or: $0 --all" | ||
| exit 1 | ||
| fi | ||
|
|
||
| START_SHARD=$1 | ||
| END_SHARD=$2 | ||
|
|
||
| # Configurations | ||
| JAR_PATH="target/bigtable-beam-import-2.17.0-shaded.jar" | ||
|
|
||
| # --- AUTO-PARALLEL MODE --- | ||
| if [ "$1" == "--all" ]; then | ||
| echo "🚀 Starting fully automated snapshot import..." | ||
|
|
||
| # Step 1: Perform ONLY the restore step | ||
| echo "Step 1/2: Performing snapshot restore (blocking)..." | ||
| java -jar ${JAR_PATH} importsnapshot \ | ||
| --runner=DataflowRunner \ | ||
| --project=${PROJECT_ID} \ | ||
| --bigtableInstanceId=${INSTANCE_ID} \ | ||
| --bigtableTableId=${TABLE_NAME} \ | ||
| --hbaseSnapshotSourceDir=${SNAPSHOT_SOURCE_DIR} \ | ||
| --snapshots=${SNAPSHOT_NAME}:${TABLE_NAME} \ | ||
| --stagingLocation=gs://${BUCKET}/dataflow/staging \ | ||
| --tempLocation=gs://${BUCKET}/dataflow/temp \ | ||
| --region=${REGION} \ | ||
| --performOnlyRestoreStep=true \ | ||
| --jobName="restore-job" \ | ||
| --network=${NETWORK} \ | ||
| --subnetwork=${SUBNETWORK} | ||
|
|
||
| echo "Restore completed. Proceeding to data import." | ||
|
|
||
| # Step 2: Launch parallel groups of 4 | ||
| echo "Step 2/2: Launching parallel groups of 4 shards..." | ||
| SHARDS_PER_GROUP=4 | ||
|
|
||
| for (( start=0; start<$NUM_SHARDS; start+=$SHARDS_PER_GROUP )); do | ||
| end=$((start + SHARDS_PER_GROUP - 1)) | ||
| [ $end -ge $NUM_SHARDS ] && end=$((NUM_SHARDS - 1)) | ||
|
|
||
| echo "Launching group: shards $start to $end in background" | ||
| # Call ourselves with the range! | ||
| $0 $start $end & | ||
| done | ||
|
|
||
| echo "All groups launched. Waiting for all background jobs to finish..." | ||
| wait | ||
| echo "🎉 All import jobs completed!" | ||
| exit 0 | ||
| fi | ||
| # ---------------------------------------- | ||
|
|
||
| # Standard Range Mode | ||
| for i in $(seq $START_SHARD $END_SHARD); do | ||
| echo "Submitting Dataflow job for shardIndex: $i" | ||
|
|
||
| # We skip restore for all shards if running via --all because Step 1 handled it. | ||
| # If running manually via ranges, shard 0 will perform restore. | ||
| SKIP_RESTORE="true" | ||
| if [ $i -eq 0 ]; then | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this work? If you have --all and have alredy restored the snaphsot, won't this re-restore? |
||
| SKIP_RESTORE="false" | ||
| fi | ||
|
|
||
| JOB="job-${i}" | ||
| java -jar ${JAR_PATH} importsnapshot \ | ||
| --runner=DataflowRunner \ | ||
| --project=${PROJECT_ID} \ | ||
| --bigtableInstanceId=${INSTANCE_ID} \ | ||
| --bigtableTableId=${TABLE_NAME} \ | ||
| --hbaseSnapshotSourceDir=${SNAPSHOT_SOURCE_DIR} \ | ||
| --snapshots=${SNAPSHOT_NAME}:${TABLE_NAME} \ | ||
| --stagingLocation=gs://${BUCKET}/dataflow/staging \ | ||
| --tempLocation=gs://${BUCKET}/dataflow/temp \ | ||
| --workerMachineType=n1-highmem-4 \ | ||
| --diskSizeGb=500 \ | ||
| --maxNumWorkers=10 \ | ||
| --region=${REGION} \ | ||
| --serviceAccount=${SERVICE_ACCOUNT} \ | ||
| --usePublicIps=false \ | ||
| --enableSnappy=true \ | ||
| --skipRestoreStep=${SKIP_RESTORE} \ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good, but how are we passing the restorePath?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we should have a custom restore path and the script (idempoent by adding timestamp etc) and use it for restore the path and pass it as the restorepath in every job. Also, with this model, who cleans up the restore path? is there a way to trigger a cleanup at the end of the script? or we have a tool that can be used? We can also say its a manual step. but then this script should output something to the tune of "the snapshot was imported, please cleanup $RESTORE_PATH once validation succeeds." |
||
| --numShards=${NUM_SHARDS} \ | ||
| --shardIndex=$i \ | ||
| --jobName="${JOB}" \ | ||
| --network=${NETWORK} \ | ||
| --subnetwork=${SUBNETWORK} \ | ||
| --maxInflightRpcs=${MAX_INFLIGHT_RPCS} \ | ||
| --bulkMutationCloseTimeoutMinutes=${BULK_MUTATION_CLOSE_TIMEOUT_MINUTES} | ||
|
|
||
| # Sequential within this script instance | ||
| done | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this model, how does the shard1. know htat it has to wait for restore to finish from shard 0? won't it run and fail becuase restore path is empty?