augur subsample

Command line reference

Subsample sequences from an input dataset.

The input dataset can consist of a metadata file, a sequences file, or both.

See documentation page for details on configuration.

usage: augur subsample [-h] [--metadata FILE] [--sequences FILE]
                       [--sequence-index FILE] [--metadata-chunk-size N]
                       [--metadata-id-columns COLUMN [COLUMN ...]]
                       [--metadata-delimiters CHARACTER [CHARACTER ...]]
                       [--skip-checks] --config FILE
                       [--config-section KEY [KEY ...]]
                       [--search-paths DIR [DIR ...]] [--nthreads N]
                       [--seed N] [--output-metadata FILE]
                       [--output-sequences FILE] [--output-log OUTPUT_LOG]

Input options

options related to input files

--metadata

sequence metadata

--sequences

sequences in FASTA or VCF format. For large inputs, consider using –sequence-index in addition to this option.

--sequence-index

sequence composition report generated by augur index. If not provided, an index will be created on the fly. This should be generated from the same file as –sequences.

--metadata-chunk-size

maximum number of metadata records to read into memory at a time. Increasing this number can reduce run times at the cost of more memory used.

Default: 100000

--metadata-id-columns

names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.

Default: ('strain', 'name')

--metadata-delimiters

delimiters to accept when reading a metadata file. Only one delimiter will be inferred.

Default: (',', '\t')

--skip-checks

use this option to skip checking for duplicates in sequences and whether ids in metadata have a sequence entry. Can improve performance on large files. Note that this should only be used if you are sure there are no duplicate sequences or mismatched ids since they can lead to errors in downstream Augur commands.

Default: False

Configuration options

options related to configuration

--config

augur subsample config file. The expected config options must be defined at the top level, or within a specific section using –config-section.

--config-section

Use a section of the file given to –config by listing the keys leading to the section. Provide one or more keys. (default: use the entire file)

--search-paths, --search-path

One or more directories to search for relative filepaths specified in the config file. If a file exists in multiple directories, only the file from the first directory will be used. This can also be set via the environment variable ‘AUGUR_SEARCH_PATHS’. Specified directories will be considered before the defaults, which are: (1) directory containing the config file (2) current working directory

--nthreads

Number of CPUs/cores/threads/jobs to utilize at once. For augur subsample, this means the number of samples to run simultaneously. Individual samples are limited to a single thread. The final augur filter call can take advantage of multiple threads.

Default: 1

--seed

random number generator seed for reproducible outputs (with same input data).

Output options

options related to output files

--output-metadata

output metadata file

--output-sequences

output sequences file

--output-log

Tab-delimited file to debug sequence inclusion in samples. All sequences have a row with filter=filter_by_exclude_all. The sequences included in the output each have an additional row per sample that included it (there may be multiple). These rows have filter=force_include_strains with kwargs pointing to a temporary file that hints at the intermediate sample it came from.

Terminology

sample

This term can refer to either the process of creating a subset or the subset itself:

  1. Process: Selecting a subset of sequences from a dataset according to specific parameters for filtering and subsampling (e.g. minimum/maximum date, minimum/maximum sequence length, sample size).

    Example: Run the focal sample …

  2. Resulting subset: The set of sequences obtained from the process described in (1).

    Example: The contextual sample consisted of …

Configuration

The --config option expects a YAML-formatted configuration file. This section describes how the file should be structured.

defaults:
  # default sample options
samples:
  <sample 1>:
    # sample options
  <sample 2>:
    # sample options
  

Tip

Use --config-section to read from a configuration file that puts these options under a specific section.

defaults

The defaults section is optional and allows you to specify common options that apply to all samples. This reduces repetition when multiple samples share the same criteria.

Options specified in the defaults section can be overridden by individual samples. If both defaults and a specific sample define the same option, the sample-specific value takes precedence.

Note that some options are only available at the sample level and cannot be specified in defaults.

Option

Type

Description

exclude

string(s)

File(s) with list of strains to exclude.

exclude_all

boolean

Exclude all strains by default. Use this with the include arguments to select a specific subset of strains.

exclude_ambiguous_dates_by

one of:

  • any
  • day
  • month
  • year

Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”).

exclude_where

string(s)

Exclude sequences matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND.

include

string(s)

File(s) with list of strains to include regardless of priorities, subsampling, or absence of an entry in sequences.

include_where

string(s)

Include sequences with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any strains matching these rules will be included regardless of priorities, subsampling, or absence of an entry in sequences.

min_date

string or integer

Minimal cutoff for date (inclusive). Supported formats:

  1. an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or

  2. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or

  3. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)

max_date

string or integer

Maximal cutoff for date (inclusive). Supported formats:

  1. an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or

  2. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or

  3. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)

min_length

integer

Minimal length of the sequences, only counting standard nucleotide characters A, C, G, or T (case-insensitive).

max_length

integer

Maximum length of the sequences, only counting standard nucleotide characters A, C, G, or T (case-insensitive).

non_nucleotide

boolean

Exclude sequences that contain illegal characters.

query

string

Filter sequences by attribute. Uses Pandas DataFrame query syntax. (e.g., “country == ‘Colombia’” or “(country == ‘USA’ & (division == ‘Washington’))”)

query_columns

string(s)

Use alongside query to specify columns and data types in the format ‘column:type’, where type is one of (bool,float,int,str). Automatic type inference will be attempted on all unspecified columns used in the query. Example: region:str coverage:float.

samples

samples must contain at least one sample. Sample-specific options override any values set in the defaults section.

Option

Type

Description

exclude

string(s)

File(s) with list of strains to exclude.

exclude_all

boolean

Exclude all strains by default. Use this with the include arguments to select a specific subset of strains.

exclude_ambiguous_dates_by

one of:

  • any
  • day
  • month
  • year

Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”).

exclude_where

string(s)

Exclude sequences matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND.

include

string(s)

File(s) with list of strains to include regardless of priorities, subsampling, or absence of an entry in sequences.

include_where

string(s)

Include sequences with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any strains matching these rules will be included regardless of priorities, subsampling, or absence of an entry in sequences.

min_date

string or integer

Minimal cutoff for date (inclusive). Supported formats:

  1. an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or

  2. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or

  3. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)

max_date

string or integer

Maximal cutoff for date (inclusive). Supported formats:

  1. an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or

  2. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or

  3. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)

min_length

integer

Minimal length of the sequences, only counting standard nucleotide characters A, C, G, or T (case-insensitive).

max_length

integer

Maximum length of the sequences, only counting standard nucleotide characters A, C, G, or T (case-insensitive).

non_nucleotide

boolean

Exclude sequences that contain illegal characters.

query

string

Filter sequences by attribute. Uses Pandas DataFrame query syntax. (e.g., “country == ‘Colombia’” or “(country == ‘USA’ & (division == ‘Washington’))”)

query_columns

string(s)

Use alongside query to specify columns and data types in the format ‘column:type’, where type is one of (bool,float,int,str). Automatic type inference will be attempted on all unspecified columns used in the query. Example: region:str coverage:float.

group_by

string(s)

Grouping columns for subsampling. Notes:

  1. Grouping by [‘month’, ‘week’, ‘year’] is only supported when there is a ‘date’ column in the metadata.

  2. ‘week’ uses the ISO week numbering system, where a week starts on a Monday and ends on a Sunday.

  3. ‘month’ and ‘week’ grouping cannot be used together.

  4. Custom columns [‘month’, ‘week’, ‘year’] in the metadata are ignored for grouping. Please rename them if you want to use their values for grouping.

group_by_weights

string

TSV file defining weights for grouping. Requirements:

  1. Lines starting with ‘#’ are treated as comment lines.

  2. The first non-comment line must be a header row.

  3. There must be a numeric weight column (weights can take on any non-negative values).

  4. Other columns must be a subset of grouping columns, with combinations of values covering all combinations present in the metadata.

  5. This option only applies when grouping columns and a total sample size are provided.

  6. This option can only be used when probabilistic sampling is allowed.

Notes:

  1. Any grouping columns absent from this file will be given equal weighting across all values within groups defined by the other weighted columns.

  2. An entry with the value default under all columns will be treated as the default weight for specific groups present in the metadata but missing from the weights file. If there is no default weight and the metadata contains rows that are not covered by the given weights, augur filter will exit with an error.

probabilistic_sampling

boolean

Allow probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when a total sample size is provided.

sequences_per_group

integer

Select no more than this number of sequences per category.

max_sequences

integer

Select no more than this number of sequences (i.e. total sample size). Can be used without grouping columns.

Implementation details

  • Configurations containing a single sample are run using a single call to augur filter.

  • Configurations containing multiple samples are run using multiple calls to augur filter.

    Each sample has its own call to augur filter, known as intermediate calls. These can run in parallel when --nthreads > 1.

    Each intermediate call uses --output-strains to write a text file containing the selected sequence ids for that sample.

    The output dataset is produced by a final augur filter call that uses the union of all sample id files to subset the input dataset.

  • CLI and YAML config options map closely to augur filter options.

    The following table shows the mapping between augur subsample and augur filter CLI options.

    augur subsample CLI option

    augur filter CLI option

    --metadata --metadata
    --metadata-chunk-size --metadata-chunk-size
    --metadata-delimiters --metadata-delimiters
    --metadata-id-columns --metadata-id-columns
    --sequences --sequences
    --sequence-index --sequence-index
    --seed --subsample-seed
    --output-metadata --output-metadata
    --output-sequences --output-sequences
    --output-log --output-log
    --skip-checks --skip-checks

    The following table shows the mapping between augur subsample sample configuration options and augur filter CLI options.

    YAML config option

    augur filter CLI option

    exclude --exclude
    exclude_all --exclude-all
    exclude_ambiguous_dates_by --exclude-ambiguous-dates-by
    exclude_where --exclude-where
    include --include
    include_where --include-where
    min_date --min-date
    max_date --max-date
    min_length --min-length
    max_length --max-length
    non_nucleotide --non-nucleotide
    query --query
    query_columns --query-columns
    group_by --group-by
    group_by_weights --group-by-weights
    probabilistic_sampling --probabilistic-sampling / --no-probabilistic-sampling
    sequences_per_group --sequences-per-group
    max_sequences --subsample-max-sequences

    Note that the following augur filter options are not supported:

    • --priority

    • --output-group-by-sizes

    • --output-strains

    • --empty-output-reporting