augur filter
Filter and subsample a sequence set.
SeqKit is used behind the scenes to handle FASTA files, but this should be considered an implementation detail that may change in the future. The CLI program seqkit must be available. If it’s not on PATH (or you want to use a version different from what’s on PATH), set the SEQKIT environment variable to path of the desired seqkit executable.
VCFtools is used behind the scenes to handle VCF files, but this should be considered an implementation detail that may change in the future. The CLI program vcftools must be available on PATH.
usage: augur filter [-h] --metadata FILE [--sequences SEQUENCES]
[--sequence-index SEQUENCE_INDEX]
[--metadata-chunk-size METADATA_CHUNK_SIZE]
[--metadata-id-columns METADATA_ID_COLUMNS [METADATA_ID_COLUMNS ...]]
[--metadata-delimiters METADATA_DELIMITERS [METADATA_DELIMITERS ...]]
[--skip-checks] [--query QUERY]
[--query-columns QUERY_COLUMNS [QUERY_COLUMNS ...]]
[--min-date MIN_DATE] [--max-date MAX_DATE]
[--exclude-ambiguous-dates-by {any,day,month,year}]
[--exclude EXCLUDE [EXCLUDE ...]]
[--exclude-where EXCLUDE_WHERE [EXCLUDE_WHERE ...]]
[--exclude-all] [--include INCLUDE [INCLUDE ...]]
[--include-where INCLUDE_WHERE [INCLUDE_WHERE ...]]
[--min-length MIN_LENGTH] [--max-length MAX_LENGTH]
[--non-nucleotide] [--group-by GROUP_BY [GROUP_BY ...]]
[--sequences-per-group SEQUENCES_PER_GROUP | --subsample-max-sequences SUBSAMPLE_MAX_SEQUENCES]
[--probabilistic-sampling | --no-probabilistic-sampling]
[--group-by-weights FILE] [--priority PRIORITY]
[--subsample-seed SUBSAMPLE_SEED]
[--output-sequences OUTPUT_SEQUENCES]
[--output-metadata OUTPUT_METADATA]
[--output-strains OUTPUT_STRAINS]
[--output-log OUTPUT_LOG]
[--output-group-by-sizes OUTPUT_GROUP_BY_SIZES]
[--empty-output-reporting {error,warn,silent}]
[--nthreads N] [--output FILE] [-o FILE]
inputs
metadata and sequences to be filtered
- --metadata
sequence metadata
- --sequences, -s
sequences in FASTA or VCF format. For large inputs, consider using –sequence-index in addition to this option.
- --sequence-index
sequence composition report generated by augur index. If not provided, an index will be created on the fly. This should be generated from the same file as –sequences.
- --metadata-chunk-size
maximum number of metadata records to read into memory at a time. Increasing this number can speed up filtering at the cost of more memory used.
Default:
100000- --metadata-id-columns
names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.
Default:
('strain', 'name')- --metadata-delimiters
delimiters to accept when reading a metadata file. Only one delimiter will be inferred.
Default:
(',', '\t')- --skip-checks
use this option to skip checking for duplicates in sequences and whether ids in metadata have a sequence entry. Can improve performance on large files. Note that this should only be used if you are sure there are no duplicate sequences or mismatched ids since they can lead to errors in downstream Augur commands.
Default:
False
metadata filters
filters to apply to metadata
- --query
Filter sequences by attribute. Uses Pandas DataFrame query syntax. (e.g., “country == ‘Colombia’” or “(country == ‘USA’ & (division == ‘Washington’))”)
- --query-columns
Use alongside query to specify columns and data types in the format ‘column:type’, where type is one of (bool,float,int,str). Automatic type inference will be attempted on all unspecified columns used in the query. Example: region:str coverage:float.
- --min-date
Minimal cutoff for date (inclusive). Supported formats:
an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or
a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or
a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)
- --max-date
Maximal cutoff for date (inclusive). Supported formats:
an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or
a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or
a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)
- --exclude-ambiguous-dates-by
Possible choices: any, day, month, year
Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”).
- --exclude
File(s) with list of strains to exclude.
- --exclude-where
Exclude sequences matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND.
- --exclude-all
Exclude all strains by default. Use this with the include arguments to select a specific subset of strains.
Default:
False- --include
File(s) with list of strains to include regardless of priorities, subsampling, or absence of an entry in sequences.
- --include-where
Include sequences with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any strains matching these rules will be included regardless of priorities, subsampling, or absence of an entry in sequences.
sequence filters
filters to apply to sequence data
- --min-length
Minimal length of the sequences, only counting standard nucleotide characters A, C, G, or T (case-insensitive).
- --max-length
Maximum length of the sequences, only counting standard nucleotide characters A, C, G, or T (case-insensitive).
- --non-nucleotide
Exclude sequences that contain illegal characters.
Default:
False
subsampling
options to subsample filtered data
- --group-by
Grouping columns for subsampling. Notes:
Grouping by [‘month’, ‘week’, ‘year’] is only supported when there is a ‘date’ column in the metadata.
‘week’ uses the ISO week numbering system, where a week starts on a Monday and ends on a Sunday.
‘month’ and ‘week’ grouping cannot be used together.
Custom columns [‘month’, ‘week’, ‘year’] in the metadata are ignored for grouping. Please rename them if you want to use their values for grouping.
Default:
[]- --sequences-per-group
Select no more than this number of sequences per category.
- --subsample-max-sequences
Select no more than this number of sequences (i.e. total sample size). Can be used without grouping columns.
- --probabilistic-sampling
Allow probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when a total sample size is provided.
Default:
True- --no-probabilistic-sampling
Default:
True- --group-by-weights
TSV file defining weights for grouping. Requirements:
Lines starting with ‘#’ are treated as comment lines.
The first non-comment line must be a header row.
There must be a numeric
weightcolumn (weights can take on any non-negative values).Other columns must be a subset of grouping columns, with combinations of values covering all combinations present in the metadata.
This option only applies when grouping columns and a total sample size are provided.
This option can only be used when probabilistic sampling is allowed.
Notes:
Any grouping columns absent from this file will be given equal weighting across all values within groups defined by the other weighted columns.
An entry with the value
defaultunder all columns will be treated as the default weight for specific groups present in the metadata but missing from the weights file. If there is no default weight and the metadata contains rows that are not covered by the given weights, augur filter will exit with an error.
- --priority
- tab-delimited file with list of priority scores for strains (e.g., “<strain>t<priority>”) and no header.
When scores are provided, Augur converts scores to floating point values, sorts strains within each subsampling group from highest to lowest priority, and selects the top N strains per group where N is the calculated or requested number of strains per group. Higher numbers indicate higher priority. Since priorities represent relative values between strains, these values can be arbitrary.
- --subsample-seed
random number generator seed to allow reproducible subsampling (with same input data).
outputs
options related to outputs, at least one of the possible representations of filtered data (–output-sequences, –output-metadata, –output-strains) is required
- --output-sequences
filtered sequences in FASTA format
- --output-metadata
metadata for strains that passed filters
- --output-strains
list of strains that passed filters (no header)
- --output-log
tab-delimited file with one row for each filtered strain and the reason it was filtered. Keyword arguments used for a given filter are reported in JSON format in a kwargs column.
- --output-group-by-sizes
tab-delimited file one row per group with target size.
- --empty-output-reporting
Possible choices: error, warn, silent
How should empty outputs be reported when no strains pass filtering and/or subsampling.
Default:
error
other
other options
- --nthreads
Number of CPUs/cores/threads/jobs to utilize at once.
Default:
1
deprecated
options to be removed in a future major version
- --output
alias to –output-sequences
- -o
alias to –output-sequences