augur.io.sequences module
- augur.io.sequences.find_feature_errors(features)
Find and return errors for features parsed from a GFF/GenBank reference file.
- Parameters:
features (dict) – keys: feature names, values:
Bio.SeqFeature.SeqFeature- Returns:
Error messages
- Return type:
- augur.io.sequences.get_biopython_format(augur_format)
Validate sequence file format and return the inferred Biopython format.
- Return type:
- augur.io.sequences.is_vcf(filename)
Convenience method to check if a file is a vcf file.
Examples
>>> is_vcf(None) False >>> is_vcf("./foo") False >>> is_vcf("./foo.vcf") True >>> is_vcf("./foo.vcf.GZ") True
- augur.io.sequences.load_features(reference, feature_names=None)
Parse a GFF/GenBank reference file. See the docstrings for _read_gff and _read_genbank for details.
- Parameters:
- Returns:
features – keys: feature names, values:
Bio.SeqFeature.SeqFeatureNote that feature names may not equivalent to GenBank feature keys- Return type:
- Raises:
AugurError – If the reference file doesn’t exist, is malformed / empty, or has errors
- augur.io.sequences.read_sequence_ids(file, nthreads)
Get unique identifiers from a sequence file.
- augur.io.sequences.read_sequences(*paths, format='fasta')
Read sequences from one or more paths.
Automatically infer compression mode (e.g., gzip, etc.) and return a stream of sequence records given the file format.
- augur.io.sequences.read_single_sequence(path, format='fasta')
Read a single sequence from a path.
Automatically infers compression mode.
- augur.io.sequences.seqkit()
Internal helper for invoking SeqKit.
Unlike
augur.merge.sqlite3(), this function is not a wrapper around subprocess.run. It is meant to be called without any parameters and only returns the location of the executable. This is due to differences in the way the two programs are invoked.
- augur.io.sequences.subset_fasta(input_filename, output_filename, ids_file, nthreads)
- augur.io.sequences.write_VCF_translation(prot_dict, vcf_file_name, ref_file_name)
Writes out a VCF-style file (which seems to be minimally handleable by vcftools and pyvcf) of the AA differences between sequences and the reference. This is a similar format created/used by read_in_vcf except that there is one of these dicts (with sequences, reference, positions) for EACH gene.
Also writes out a fasta of the reference alignment.
EBH 12 Dec 2017
- augur.io.sequences.write_records_to_fasta(records, fasta, seq_id_field='strain', seq_field='sequence')
Write sequences from dict records to a fasta file. Yields the records with the seq_field dropped so that they can be consumed downstream.
- Parameters:
- Yields:
dict – A copy of the record with seq_field dropped
- Raises:
AugurError – When the sequence id field or sequence field does not exist in a record
- augur.io.sequences.write_sequences(sequences, path_or_buffer, format='fasta')
Write sequences to a given path in the given format.
Automatically infer compression mode (e.g., gzip, etc.) based on the path’s filename extension.
- Parameters:
sequences (iterable of Bio.SeqRecord.SeqRecord) – A list-like collection of sequences to write
path_or_buffer (str or os.PathLike or io.StringIO) – A path to a file to write the given sequences in the given format.
format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.)
- Returns:
Number of sequences written out to the given path.
- Return type:
- augur.io.sequences.write_vcf(input_filename, output_filename, dropped_samps)