General feature format

General feature format
General feature format
Filename extensions	.gff, .gff3
Internet media type	text/gff3
Developed by	Sanger Centre (v2), Sequence Ontology Project (v3)
Type of format	Bioinformatics
Extended from	Tab-separated values
Open format?	yes
Website	github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

In bioinformatics, the general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences.

GFF Versions

The following versions of GFF exist:

General Feature Format Version 2, generally deprecated
- Gene Transfer Format 2.2, a derivative used by Ensembl
Generic Feature Format Version 3
- Genome Variation Format, with additional pragmas and attributes for sequence_alteration features

GFF2/GTF had a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.

The GTF is identical to GFF, version 2.^[1]

GFF general structure

All GFF formats (GFF2, GFF3 and GTF) are tab delimited with 9 fields per line. They all share the same structure for the first 7 fields, while differing in the content and format of the ninth field. Some field names have been changed in GFF3 to avoid confusion. For example, the "seqid" field was formerly referred to as "sequence", which may be confused with a nucleotide or amino acid chain. The general structure is as follows:

General GFF3 structure
Position index	Position name	Description
1	seqid	The name of the sequence where the feature is located.
2	source	The algorithm or procedure that generated the feature. This is typically the name of a software or database.
3	type	The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
4	start	Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
5	end	Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.^{[citation needed]}
6	score	Numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7	strand	Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8	phase	phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). See the section below for a detailed explanation.
9	attributes	A list of tag-value pairs separated by a semicolon with additional information about the feature.

The 8th field: phase of CDS features

Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). According to the GFF3 specification:^[2]^[3]

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.

Meta Directives

In GFF files, additional meta information can be included and follows after the ## directive. This meta information can detail GFF version, sequence region, or species (full list of meta data types can be found at Sequence Ontology specifications).

GFF software

Servers

Servers that generate this format:

Server	Example file
UniProt	[1]

Clients

Clients that use this format:

Name	Description	Links
GBrowse	GMOD genome viewer	GBrowse Archived 2019-03-28 at the Wayback Machine
IGB	Integrated Genome Browser	Integrated Genome Browser
Jalview	A multiple sequence alignment editor & viewer	Jalview
STRAP	Underlining sequence features in multiple alignments. Example output: [2]	[3]
JBrowse	JBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5	JBrowse.org
ZENBU	A collaborative, omics data integration and interactive visualization system	[4]

Validation

The modENCODE project hosts an online GFF3 validation tool with generous limits of 286.10 MB and 15 million lines.

The Genome Tools software collection contains a gff3validator tool that can be used offline to validate and possibly tidy GFF3 files. An online validation service is also available.

References

^ "GFF/GTF File Format". Ensembl. Archived from the original on 2022-06-15. Retrieved 2023-11-04.
^ "GFF3 specification". GitHub. 2018-11-24. Archived from the original on 2023-07-04.
^ "GFF3". GMOD. 2016-07-12. Archived from the original on 2023-08-25.

[1] "GFF/GTF File Format". Ensembl. Archived from the original on 2022-06-15. Retrieved 2023-11-04.

[2] "GFF3 specification". GitHub. 2018-11-24. Archived from the original on 2023-07-04.

[3] "GFF3". GMOD. 2016-07-12. Archived from the original on 2023-08-25.

[1]

[2]

[3]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

General feature format
Filename extensions	`.gff`, `.gff3`
Internet media type	`text/gff3`
Developed by	Sanger Centre (v2), Sequence Ontology Project (v3)
Type of format	Bioinformatics
Extended from	Tab-separated values
Open format?	yes
Website	github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md