augur.subsample module

Subsample sequences from an input dataset.

The input dataset can consist of a metadata file, a sequences file, or both.

See documentation page for details on configuration.

augur.subsample.AugurFilterOption

Type for an augur filter command line option. Either a single option or boolean pair of flags.

alias of str | Tuple[str, str | None]

augur.subsample.BooleanFlags

Type for a boolean pair of augur filter command line flags that configure the same option. When there is no second flag, absence of the first flag indicates the default behavior.

alias of Tuple[str, str | None]

augur.subsample.FINAL_CLI_OPTIONS: Dict[str, str | Tuple[str, str | None]] = {'output_log': '--output-log', 'output_metadata': '--output-metadata', 'output_sequences': '--output-sequences', 'skip_checks': ('--skip-checks', None)}

Mapping of argparse namespace variable name to augur filter option. These are sent to only the final augur filter call.

augur.subsample.FilterArgs

Augur filter arguments stored as a mapping from option to value.

alias of Dict[str, Any]

augur.subsample.GLOBAL_CLI_OPTIONS: Dict[str, str | Tuple[str, str | None]] = {'metadata': '--metadata', 'metadata_chunk_size': '--metadata-chunk-size', 'metadata_delimiters': '--metadata-delimiters', 'metadata_id_columns': '--metadata-id-columns', 'seed': '--subsample-seed', 'sequence_index': '--sequence-index', 'sequences': '--sequences'}

Mapping of argparse namespace variable name to augur filter option. These are sent to both intermediate and final augur filter calls.

augur.subsample.SAMPLE_CONFIG: Dict[str, str | Tuple[str, str | None]] = {'exclude': '--exclude', 'exclude_all': ('--exclude-all', None), 'exclude_ambiguous_dates_by': '--exclude-ambiguous-dates-by', 'exclude_where': '--exclude-where', 'group_by': '--group-by', 'group_by_weights': '--group-by-weights', 'include': '--include', 'include_where': '--include-where', 'max_date': '--max-date', 'max_length': '--max-length', 'max_sequences': '--subsample-max-sequences', 'min_date': '--min-date', 'min_length': '--min-length', 'non_nucleotide': ('--non-nucleotide', None), 'probabilistic_sampling': ('--probabilistic-sampling', '--no-probabilistic-sampling'), 'query': '--query', 'query_columns': '--query-columns', 'sequences_per_group': '--sequences-per-group'}

Mapping of YAML configuration key name to augur filter option. These are sent to only the intermediate augur filter calls.

class augur.subsample.Sample(name, config, global_filter_args)

Bases: object

remove_output_strains()

Remove the augur filter arguments and temporary file.

run()

Run augur filter as a subprocess.

Notes:

  • A direct import of augur.filter in Python is not used because all samples would share the same sys.stderr, which causes interleaved messages when processes are run in parallel. This is also why _run_final_filter() isn’t repurposed for use here.

  • shell=True is not used because it requires additional logic to carefully escape values such as “–metadata-delimiters , “. This is also why run_shell_command() isn’t used here.

Return type:

None

augur.subsample.get_parallelism(config_file, config_section=None, limit=None)

Compute the degree of parallelism (i.e., optimal value for --nthreads).

Inspects the subsample config file to return the degree of parallelism that should be used for --nthreads. Higher values will underutilize resources, while lower values will underallocate resources and not fully use available parallelism.

Parameters:
  • config_file (str) – Path to the subsample config file.

  • config_section (list[str] | None) – Optional list of keys to navigate to a specific section of the config file.

  • limit (int | None) – Optional upper bound for return value.

Returns:

Degree of parallelism.

Return type:

int

augur.subsample.get_referenced_files(config_file, config_section=None, search_paths=None)

Get the files referenced in a subsample config file.

Extracts and resolves all filepath values referenced in the config, including defaults and individual sample options.

Parameters:
  • config_file (str) – Path to the subsample config file.

  • config_section (Optional[List[str]]) – Optional list of keys to navigate to a specific section of the config file.

  • search_paths (Optional[List[str]]) – Optional list of directories to search for relative filepaths specified in the config file. If a file exists in multiple directories, only the file from the first directory will be used. This can also be set via the environment variable ‘AUGUR_SEARCH_PATHS’. Specified directories will be considered before the defaults, which are: (1) directory containing the config file (2) current working directory

Returns:

Resolved filepaths

Return type:

set

augur.subsample.register_parser(parent_subparsers)
Return type:

ArgumentParser

augur.subsample.run(args)

Run augur subsample.

This is implemented by calling augur filter once for each sample in the config (i.e. the intermediate calls), then one more time to combine the samples (i.e. the final call). It was inspired by several pathogen repos adopting a similar approach using Snakemake rules.

Notes on performance:

  • If multiple intermediate calls use sequence-based filters and –sequence-index is not set, each call will build its own sequence index, meaning the same work is done at least twice. A more optimal approach would be to add a preliminary step to build the sequence index then pass it down to the intermediate calls. However, this complicates things and may not be worth it if sequence indexing is rewritten: <https://github.com/nextstrain/augur/issues/1846>

  • If multiple intermediate calls use the same default filters that significantly reduce the size of the initial input dataset, each call will go through the large input dataset and filter it with the same filters, meaning the same work is done at least twice. A more optimal approach would be to run the default options through an initial augur filter call. This would output a much smaller intermediate dataset that can be used by the intermediate calls. However, this complicates things and may not be worth it if a proper input reuse approach such as database/parquet file support is adopted: <https://github.com/nextstrain/augur/issues/1574>

Return type:

None