augur.io

Interfaces for reading and writing data also known as input/output (I/O)

augur.io.load_features(reference, feature_names=None)

Parse a GFF/GenBank reference file. See the docstrings for _read_gff and _read_genbank for details.

Parameters:
  • reference (str) – File path to GFF or GenBank (.gb) reference

  • feature_names (None or set or list (optional)) – Restrict the genes we read to those in the set/list

Returns:

features – keys: feature names, values: Bio.SeqFeature.SeqFeature Note that feature names may not equivalent to GenBank feature keys

Return type:

dict

Raises:

AugurError – If the reference file doesn’t exist, is malformed / empty, or has errors

augur.io.open_file(path_or_buffer, mode='r', **kwargs)

Opens a given file path and returns the handle.

Transparently handles compressed inputs and outputs.

Parameters:
  • path_or_buffer – Name of the file to open or an existing IO buffer

  • mode (str) – Mode to open file (read or write)

Returns:

File handle object

Return type:

IO

augur.io.read_metadata(metadata_file, delimiters=(',', '\\t'), columns=None, id_columns=('strain', 'name'), keep_id_as_column=False, chunk_size=None, dtype=None)

Read metadata from a given filename and into a pandas DataFrame or iterator of DataFrames when chunk_size is specified.

Parameters:
  • metadata_file (str) – Path to a metadata file to load.

  • delimiters (Sequence[str]) – List of possible delimiters to check for between columns in the metadata. Only one delimiter will be inferred.

  • columns (list[str] | None) – List of columns to read. If unspecified, read all columns.

  • id_columns (Sequence[str]) – List of possible id column names to check for, ordered by priority. Only one id column will be inferred.

  • keep_id_as_column (bool) – If true, keep the resolved id column as a column in addition to setting it as the DataFrame index.

  • chunk_size (int | None) – Size of chunks to stream from disk with an iterator instead of loading the entire input file into memory.

  • dtype (dict[str, Any] | str | None) – Data types to apply to columns in metadata. If unspecified, pandas data type inference will be used. See documentation for an argument of the same name to pandas.read_csv().

Raises:

KeyError – When the metadata file does not have any valid index columns.

Examples

For standard use, request a metadata file and get a pandas DataFrame.

>>> read_metadata("tests/functional/filter/data/metadata.tsv").index.values[0]
'COL/FLR_00024/2015'

Requesting an index column that doesn’t exist should produce an error.

>>> read_metadata("tests/functional/filter/data/metadata.tsv", id_columns=("Virus name",))
Traceback (most recent call last):
  ...
Exception: None of the possible id columns ('Virus name') were found in the metadata's columns ('strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url')

We also allow iterating through metadata in fixed chunk sizes.

>>> for chunk in read_metadata("tests/functional/filter/data/metadata.tsv", chunk_size=5):
...     print(chunk.shape)
...
(5, 14)
(5, 14)
(2, 14)
augur.io.read_sequences(*paths, format='fasta')

Read sequences from one or more paths.

Automatically infer compression mode (e.g., gzip, etc.) and return a stream of sequence records given the file format.

Parameters:
  • paths (Iterable[Union[str, PathLike]]) – One or more paths to sequence files.

  • format (str) – Format of input sequences. Either “fasta” or “genbank”.

Return type:

Iterator[SeqRecord]

Returns:

Sequence records from the given path(s).

augur.io.read_strains(*files, comment_char='#')

Reads strain names from one or more plain text files and returns the set of distinct strains.

Strain names can be commented with full-line or inline comments. For example, the following is a valid strain names file:

# this is a comment at the top of the file
strain1  # exclude strain1 because it isn't sequenced properly
strain2
  # this is an empty line that will be ignored.
Parameters:

files (iterable of str) – one or more names of text files with one strain name per line

Returns:

strain names from the given input files

Return type:

set

augur.io.write_json(data, file, minify=None, minify_threshold_mb=5, indent=2)

Write data as JSON to the given file, creating parent directories if necessary.

Parameters:
  • data (dict) – data to write out to JSON

  • file – file path or handle to write to

  • minify (bool or None, optional) – Control output minification. True forces minified output, False forces non-minified output, None (default) uses auto-detection based on minify_threshold_mb. A truthy value in the environment variable AUGUR_MINIFY_JSON also forces minified output.

  • minify_threshold_mb (int or float, optional) – Threshold in megabytes above which output is automatically minified. Only applies when minify is None.

  • indent (int or None, optional) – JSON indentation level when not minifying.

Raises:

OSError

augur.io.write_sequences(sequences, path_or_buffer, format='fasta')

Write sequences to a given path in the given format.

Automatically infer compression mode (e.g., gzip, etc.) based on the path’s filename extension.

Parameters:
  • sequences (iterable of Bio.SeqRecord.SeqRecord) – A list-like collection of sequences to write

  • path_or_buffer (str or os.PathLike or io.StringIO) – A path to a file to write the given sequences in the given format.

  • format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.)

Returns:

Number of sequences written out to the given path.

Return type:

int