augur.io.sequences module

augur.io.sequences.find_feature_errors(features)

Find and return errors for features parsed from a GFF/GenBank reference file.

Parameters:: features (dict) – keys: feature names, values: Bio.SeqFeature.SeqFeature
Returns:: Error messages
Return type:: list of str

augur.io.sequences.get_biopython_format(augur_format)

Validate sequence file format and return the inferred Biopython format.

Return type:: str

augur.io.sequences.is_vcf(filename)

Convenience method to check if a file is a vcf file.

Examples

>>> is_vcf(None)
False
>>> is_vcf("./foo")
False
>>> is_vcf("./foo.vcf")
True
>>> is_vcf("./foo.vcf.GZ")
True

augur.io.sequences.load_features(reference, feature_names=None)

Parse a GFF/GenBank reference file. See the docstrings for _read_gff and _read_genbank for details.

Parameters:

reference (str) – File path to GFF or GenBank (.gb) reference
feature_names (None or set or list (optional)) – Restrict the genes we read to those in the set/list

Returns:

features – keys: feature names, values: Bio.SeqFeature.SeqFeature Note that feature names may not equivalent to GenBank feature keys

Return type:

dict

Raises:

AugurError – If the reference file doesn’t exist, is malformed / empty, or has errors

augur.io.sequences.read_sequence_ids(file, nthreads): Get unique identifiers from a sequence file.

augur.io.sequences.read_sequences(*paths, format='fasta')

Read sequences from one or more paths.

Automatically infer compression mode (e.g., gzip, etc.) and return a stream of sequence records given the file format.

Parameters:

paths (Iterable[Union[str, PathLike]]) – One or more paths to sequence files.
format (str) – Format of input sequences. Either “fasta” or “genbank”.

Return type:

Iterator[SeqRecord]

Returns:

Sequence records from the given path(s).

augur.io.sequences.read_single_sequence(path, format='fasta')

Read a single sequence from a path.

Automatically infers compression mode.

Parameters:

path (Union[str, PathLike]) – Path to a sequence file.
format (str) – Format of input file. Either “fasta” or “genbank”.

Return type:

SeqRecord

Returns:

A single sequence record from the given path.

augur.io.sequences.seqkit()

Internal helper for invoking SeqKit.

Unlike augur.merge.sqlite3(), this function is not a wrapper around subprocess.run. It is meant to be called without any parameters and only returns the location of the executable. This is due to differences in the way the two programs are invoked.

augur.io.sequences.subset_fasta(input_filename, output_filename, ids_file, nthreads)

augur.io.sequences.write_VCF_translation(prot_dict, vcf_file_name, ref_file_name)

Writes out a VCF-style file (which seems to be minimally handleable by vcftools and pyvcf) of the AA differences between sequences and the reference. This is a similar format created/used by read_in_vcf except that there is one of these dicts (with sequences, reference, positions) for EACH gene.

Also writes out a fasta of the reference alignment.

EBH 12 Dec 2017

augur.io.sequences.write_records_to_fasta(records, fasta, seq_id_field='strain', seq_field='sequence')

Write sequences from dict records to a fasta file. Yields the records with the seq_field dropped so that they can be consumed downstream.

Parameters:

records (iterable of dict) – Iterator that yields dict that contains sequences
fasta (str) – Path to FASTA file
seq_id_field (str, optional) – Field name for the sequence identifier
seq_field (str, optional) – Field name for the genomic sequence

Yields:

dict – A copy of the record with seq_field dropped

Raises:

AugurError – When the sequence id field or sequence field does not exist in a record

augur.io.sequences.write_sequences(sequences, path_or_buffer, format='fasta')

Write sequences to a given path in the given format.

Automatically infer compression mode (e.g., gzip, etc.) based on the path’s filename extension.

Parameters:

sequences (iterable of Bio.SeqRecord.SeqRecord) – A list-like collection of sequences to write
path_or_buffer (str or os.PathLike or io.StringIO) – A path to a file to write the given sequences in the given format.
format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.)

Returns:

Number of sequences written out to the given path.

Return type:

int

augur.io.sequences.write_vcf(input_filename, output_filename, dropped_samps)