Guide to helping develop the txsim-pipeline
This applies to a new method developed for segmentation, dot assignment, normalization etc. Examples are pciSeq and Baysor. The easiest types of methods to add are those that exist as python packages on PyPI or as a conda package. Other methods either not written in python or those that are not packages, require a bit more effort to make them compatible with the pipeline.
All methods will first require a new snakemake rule. Information on the general syntax of snakemake can be found here. For this project each method has its own rule in the Snakefile, and should be named the same as the name of the method. The current convention for txsim is run a python script (see the scripts folder) using a shell command with command line arguments, and then parse said arguments in the python script (see below). The script is run in an environment specified in envs.
All methods should have their default parameters specified in the defaults.yaml in configs. These will not be passed into the method, and are for the readable params dictionary after a run. This is useful when sorting outputs.
Each method may have a unique way of passing in parameters, however, to work with the pipeline they must be formatted in config.yaml in config. The recommended way is to have the following hierarchy:
batch:
dataset: [dataset_1]
workflow: [group_1, group_2, group_3]
group_1:
method_1:
parameter_1: 0
parameter_2: True
parameter_3: 2.5
method_2:
parameter_1: False
group_2:
method_3:
group_3:
method_4:
group_3_params:
parameter_1: True
parameter_2: 5
...
With this structure, the get_params function within the Snakefile will return a method-specific parameter dictionary (using 'p' as a third argument) or a group parameter for that method (using the parameter name as a third argument). For instance, get_params would return
(dict) {parameter_1: 0, parameter_2: True, parameter_3: 2.5}
for method_1 using the 'p' argument. Likewise, get_params would return 5 using the parameter_2 argument for method_4 since those params apply to the whole group. This function is very useful for inputting parameters into a snakemake rule.
Currently, each method has its own python script, or a few related methods will share one script. Within the script, specific methods are run, either using a wrapper in the txsim package, an imported library, or through the command line interface. The last is used for non-python methods (i.e. Baysor). After the method is run, the outputs must be saved from the python script. Specifying the output in the Snakefile will not save the file automatically. The naming conventions for input and output are specified in the README.md. For a new method, it is recommended to add a new python script corresponding to that method.
Many methods are written as python packages on PyPI or conda. To utilize a new package/library, simply add a new conda environment in the envs folder, and use this environment in the snakemake rule. If adding a new package to integrate into the main txsim library, be sure to update the dependencies in pyproject.toml accordingly.
There are many ways to add a local library or repository, including locally importing the package via pip using the -e flag. The exact way of integrating a local library will depend on the library itself.
Adding a non-python method may be more difficult. Many R methods can be installed from conda, or even run in python. However, some methods may require the creation of a new container (i.e. a Docker container), which can be run through snakemake using Singularity (specified by the container flag in the rule). The exact implementation will vary greatly from method to method, but Baysor is an example of such a method. See the snakemake documentation for more info on containers.
The pipeline itself mainly contains scripts and files specifying the workflow organization. In order to edit specific functions of txsim (such as how counts are normalized), directly edit the txsim package. Adding datasets is described in the main README. For the general structure of the repository:
- The
scriptsfolder contains python scripts which run individual methods and parse command line arguments - The
envsfolder contains environments used bysnakemaketo run the scripts - The
configsfolder contains config files and defaults TxsimConfig.pycontains a ParsedConfig class which parses the config file, establishes the method combinations, and saves the parameters as readable csv files.Snakefilespecifies the snakemake rules which are used to determine which scripts to run and which parameters to use.