- Introduction
- Documentation
- Dependencies
- File Validator
- Contributing
- Sage Bionetworks Only
- Testing
- Production
- Github Workflows
This repository documents code used to gather, QC, standardize, and analyze data uploaded by institutes participating in AACR's Project GENIE (Genomics, Evidence, Neoplasia, Information, Exchange).
For more information about the AACR genie repository, visit the GitHub Pages site.
This package contains both R, Python and cli tools. These are tools or packages you will need, to be able to reproduce these results:
- Python >=3.10 or <3.12
pip install -r requirements.txt
- bedtools
- R 4.3.3
renv::install()- Follow instructions here to install synapser
- Java = 21
- For mac users, it seems to work better to run
brew install java
- For mac users, it seems to work better to run
- wget
- For mac users, have to run
brew install wget
- For mac users, have to run
One of the features of the aacrgenie package is that is provides a local validation tool that GENIE data contributors and install and use to validate their files locally prior to uploading to Synapse.
These instructions will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client.
- Create a virtual environment using package manager of your choice (e.g:
conda,pipenv,pip)
Example of creating a simple python environment
python3 -m venv <env_name>
source <env_name>/bin/activate
- Install the genie package
pip install aacrgenie
- Verify the installation
genie -v
- Set up authentication with Synapse through the local .synapseConfig or using an environment variable
Get help of all available commands
genie validate -h
Running validator on clinical file
genie validate data_clinical_supp_SAGE.txt SAGE
Running validator on cna file. Note that the flag --nosymbol-check is REQUIRED when running the validator for cna files because you would need access to an internal bed database table without it. For DEVELOPERS this is not required.
genie validate data_cna_SAGE.txt SAGE --nosymbol-check
Please view contributing guide to learn how to contribute to the GENIE package.
These are instructions on how you would setup your environment and run the pipeline locally.
- Make sure you have read through the GENIE Onboarding Docs and have access to all of the required repositories, resources and synapse projects for Main GENIE.
- Be sure you are invited to the Synapse GENIE Admin team.
- Make sure you are a Synapse certified user: Certified User - Synapse User Account Types
- Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and
git checkoutthe version of the repo pinned to the Dockerfile - Be sure to clone the annotation-tools repo: https://github.com/Sage-Bionetworks/annotation-tools and
git checkoutthe version of the repo pinned to the Dockerfile.
Follow instructions to install conda on your computer:
Install conda-forge and mamba
conda install -n base -c conda-forge mamba
Install Python and R versions via mamba
mamba create -n genie_dev -c conda-forge python=3.10 r-base=4.3
Installing via pipenv
-
Specify a python version that is supported by this repo:
pipenv --python <python_version> -
Activate your
pipenv:pipenv shell
This is the most reproducible method even though it will be the most tedious to develop with. See CONTRIBUTING docs for how to locally develop with docker.. This will setup the docker image in your environment.
-
Pull pre-existing docker image or build from Dockerfile: Pull pre-existing docker image. You can find the list of images from here.
docker pull <some_docker_image_name>Build from Dockerfile
docker build -f Dockerfile -t <some_docker_image_name> . -
Run docker image:
docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
-
Clone this repo and install the package locally.
Install Python packages. This is the more traditional way of installing dependencies. Follow instructions here to learn how to install pip.
pip install -e . pip install -r requirements.txt pip install -r requirements-dev.txtInstall R packages. Note that the R package setup of this is the most unpredictable so it's likely you have to manually install specific packages first before the rest of it will install.
Rscript R/install_packages.R -
Configure the Synapse client to authenticate to Synapse.
- Create a Synapse Personal Access token (PAT).
- Add a
~/.synapseConfigfile[authentication] authtoken = <PAT here> - OR set an environmental variable
export SYNAPSE_AUTH_TOKEN=<PAT here> - Confirm you can log in your terminal.
synapse login
-
Run the different steps of the pipeline on the test project. The
--project_id syn7208886points to the test project. You should always be using the test project when developing, testing and running locally.-
Validate all the files excluding vcf files:
python3 bin/input_to_database.py main --project_id syn7208886 --onlyValidate -
Validate all the files:
python3 bin/input_to_database.py mutation --project_id syn7208886 --onlyValidate --genie_annotation_pkg ../annotation-tools -
Process all the files aside from the mutation (maf, vcf) files. The mutation processing was split because it takes at least 2 days to process all the production mutation data. Ideally, there is a parameter to exclude or include file types to process/validate, but that is not implemented.
python3 bin/input_to_database.py main --project_id syn7208886 --deleteOld -
Process the mutation data. This command uses the
annotation-toolsrepo that you cloned previously which houses the code that standardizes/merges the mutation (both maf and vcf) files and re-annotates the mutation data with genome nexus. The--createNewMafDatabasewill create a new mutation tables in the test project. This flag is necessary for production data for two main reasons:- During processing of mutation data, the data is appended to the data, so without creating an empty table, there will be duplicated data uploaded.
- By design, Synapse Tables were meant to be appended to. When a Synapse Tables is updated, it takes time to index the table and return results. This can cause problems for the pipeline when trying to query the mutation table. It is actually faster to create an entire new table than updating or deleting all rows and appending new rows when dealing with millions of rows.
- If you run this more than once on the same day, you'll run into an issue with overwriting the narrow maf table as it already exists. Be sure to rename the current narrow maf database under
Tablesin the test synapse project and try again.
python3 bin/input_to_database.py mutation --project_id syn7208886 --deleteOld --genie_annotation_pkg ../annotation-tools --createNewMafDatabase -
Create a consortium release. Be sure to add the
--testparameter. For consistency, theprocessingDatespecified here should match the one used in theconsortium_mapfor theTESTkey nf-genie.python3 bin/database_to_staging.py <processingDate> ../cbioportal TEST --test -
Create a public release. Be sure to add the
--testparameter. For consistency, theprocessingDatespecified here should match the one used in thepublic_mapfor theTESTkey nf-genie.python3 bin/consortium_to_public.py <processingDate> ../cbioportal TEST --test
-
-
Navigate to your cloned repository on your computer/server.
-
Make sure your
developbranch is up to date with theSage-Bionetworks/Geniedevelopbranch.cd Genie git checkout develop git pull -
Create a feature branch which off the
developbranch. If there is a GitHub/JIRA issue that you are addressing, name the branch after the issue with some more detail (like{GH|GEN}-123-add-some-new-feature).git checkout -b GEN-123-new-feature -
At this point, you have only created the branch locally, you need to push this remotely to Github.
git push -u origin GEN-123-new-feature -
Add your code changes and push them via useful commit message
git add git commit changed_file.txt -m "Remove X parameter because it was unused" git push -
Once you have completed all the steps above, in Github, create a pull request (PR) from your feature branch to the
developbranch of Sage-Bionetworks/Genie.
See using docker for setting up the initial docker environment.
A docker build will be created for your feature branch every time you have an open PR on github and add the label run_integration_tests to it.
It is recommended to develop with docker. You can either write the code changes locally, push it to your remote and wait for docker to rebuild OR do the following:
-
Make any code changes. These cannot be dependency changes - those would require a docker rebuild.
-
Create a running docker container with the image that you pulled down or created earlier
docker run -d <docker_image_name> /bin/bash -c "while true; do sleep 1; done" -
Copy your code changes to the docker image:
docker cp <folder or name of file> <docker_image_name>:/root/Genie/<folder or name of files> -
Run your image in interactive mode:
docker exec -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <docker_image_name> /bin/bash -
Do any commands or tests you need to do
Follow this section when modifying the Dockerfile:
- Have your synapse authentication token handy
docker build -f Dockerfile -t <some_docker_image_name> .docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>- Run test code relevant to the dockerfile changes to make sure changes are present and working
- Once changes are tested, follow genie contributing guidelines for adding it to the repo
- Once deployed to main, make sure the CI/CD build successfully completed (our docker image gets automatically deployed via Github Actions CI/CD) here
- Check that your docker image got successfully deployed here
Currently our Github Actions will run unit tests from our test suite /tests and run integration tests - each of the pipeline steps here on the test pipeline.
These are all triggered by adding the Github label run_integration_tests on your open PR.
To trigger run_integration_tests:
- Add
run_integration_testsfor the first time when you just open your PR - Remove
run_integration_testslabel and re-add it - Make any commit and pushes when the PR is still open
If you are developing with docker, docker images for your feature branch also gets build via the run_integration_tests trigger so check that your docker image got successfully deployedhere.
Unit tests in Python are also run automatically by Github Actions on any PR and are required to pass before merging.
Otherwise, if you want to add tests and run tests outside of the CI/CD, see how to run tests and general test development
See running pipeline steps here if you want to run the integration tests locally.
You can also run them in nextflow via nf-genie
The production pipeline is run on Nextflow Tower and the Nextflow workflow is captured in nf-genie. It is wise to create an ec2 via the Sage Bionetworks service catalog to work with the production data, because there is limited PHI in GENIE.
For technical details about our CI/CD, please see the github workflows README
