how-to-extract-convert-gff3-cds-sequences-to-multifasta

https://bioinformatics.stackexchange.com/questions/2341/how-to-extract-convert-gff3-cds-sequences-to-multifasta

Using python and this GFF parser that mimics Biopython’s SeqIO parsers:

from BCBio import GFF

# Read the gff
for seq in GFF.parse('my_file.gff'):
    # only focus on the CDSs
    for feat in filter(lambda x: x.type == 'CDS',
                       seq.features):
        # extract the locus tag
        locus_tag = feat.qualifiers.get('locus_tag',
                                        ['unspecified'])[0]
        # extract the sequence
        dna_seq = seq = str(feat.extract(seq).seq)
        # simply print the sequence in fasta format
        print('>%s\n%s' % (locus_tag, dna_seq))


The gffread utility in Cufflinks package might be interesting for you. To generate a multi-fasta file with nucleotide sequences from your GFF file, then you can try:

gffread -w output_transcripts.fasta -g reference_genome.fa input_transcripts.gff

 

The xtractore program from the AEGeAn Toolkit was designed for this type of use case. Just set --type=CDS.

$ xtractore -h

xtractore: extract sequences corresponding to annotated features from the
           given sequence file

Usage: xtractore [options] features.gff3 sequences.fasta
  Options:
    -d|--debug            print debugging output
    -h|--help             print this help message and exit
    -i|--idfile: FILE     file containing a list of feature IDs (1 per line
                          with no spaces); if provided, only features with
                          IDs in this file will be extracted
    -o|--outfile: FILE    file to which output sequences will be written;
                          default is terminal (stdout)
    -t|--type: STRING     feature type to extract; can be used multiple
                          times to extract features of multiple types
    -v|--version          print version number and exit
    -V|--verbose          print verbose warning and error messages
    -w|--width: INT       width of each line of sequence in the Fasta
                          output; default is 80; set to 0 for no
                          formatting

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Protected by WP Anti Spam