TreeFam FAQ
Tree families database The Sanger Institute Beijing Genomics Institute
[Home] [Search] [Browse] [TaxaView] [Download] [FAQ]

General FAQ

Users' Guide

Developers' FAQ

Curators' FAQ

What is TreeFam?

TreeFam (Tree Families database) is a database of phylogenetic trees of gene families found in animals. It aims to develop a curated resource that presents the accurate evolutionary history of all animal gene families, as well as reliable ortholog and paralog assignments.

Curated families are being added progressively, based on seed alignments and trees in a similar fashion to Pfam. That is, like Pfam, TreeFam is a two-part database: a first part consisting of automatically generated trees (TreeFam-B) and a second part that consists of manually curated trees (TreeFam-A).

How are gene families defined?

TreeFam aims to define a gene family as a group of genes that descended from a single gene in the last common ancestor of all animals, or that first appeared within the animals. We identify the genes in one family on the basis that either (A) they are phylogenetically separated from other genes by a non-animal outgroup gene either from a yeast or a plant, or (B) they lack homologs outside the animals.

I see TF5xxxxx families are "created by hcluster". What is the algorithm behind?

TreeFam clusters (TF5xxxxx series) are created by hcluster_sg, a hierarchical clustering software for sparse graphs. Basically, hcluster_sg performs hierarchical clustering under mean distance. It reads an input file that describes the similarity between two sequences, and groups two nearest nodes at each step. When two nodes are joined, the distance between the joined node and all the other nodes are updated by mean distance. This procedure is iterated until one of the three rules is met:

In theory, we can do an exact hierarchical clustering with these three stopping rules. However, as we are always working on about half a million proteins, keeping so large a relational matrix is not practical. Given this, hcluster_sg stores the whole relationships as a sparse graph. An edge is added only if the similarity is strong enough (E-value < 0.001). Furthermore, Hcluster_sg also introduces an additional edge breaking rule:

This heuristic rule removes weak relations which are quite unlikely to be joined at a later step. It helps to improve both time and space efficiency. Hcluster_sg is carefully implemented. It is extremely fast and small. It is able to cluster all genes of 29 sequenced species in a few hours with 3 Gb memory.

Hcluster_sg is freely available and can be checked out with:

How can TreeFam be searched?

TreeFam allows users to easily search for their genes of interest. First, one can search for accession numbers from the source sequence databases such as Ensembl or WormBase. In addition, TreeFam extracts cross-references to GenBank from Ensembl, so it is possible to search for genes using their GenBank accessions as queries. It is also possible to use text searches to search for a gene name (such as leucyl-tRNA synthetase); a gene symbol (such as LARS) or its synonyms (such as LeuRS); as well as to search the TreeFam functional descriptions of curated families.

In general, the TreeFam search page finds records that contain all your query words. The search is restricted by the MySQL FULLTEXT search engine, so only words of four or more characters that begin with your query will be found. For example, a query `p53' will not match `p53'; the query `dc' matches `DCTN4' but not `CDC14'.

You can use double quotes to specify a phrase. You can also include the species name to specify the species. For example, `"cyclin A1" human mouse' will find human/mouse genes whose descriptions contain the phrase "cyclin A1". A single letter `A' or `B' (not in quotation marks) specifies whether TreeFam-A or TreeFam-B should be searched. A leading plus `+' indicates that the exact word must be present in every result returned. Without `+', words that begin with your query will be matched. For example, a query `+CCNA1 A' or `"CCNA1" A' will search TreeFam-A for any records containing the exact word "CCNA1".

To search for a family, you should search for a gene in that family. You can find the gene matching your query and then identify the family that contains that gene.

If you only have accessions from NCBI, PDB or other databases, you may input your queries in "search for external accessions" box. TreeFam has recorded cross references of Ensembl genes. According to Ensembl, the following 87 databases have been cross referenced: wormbase_gene, Uniprot/SPTREMBL, protein_id, EMBL, UniGene, RefSeq_peptide, Uniprot/SWISSPROT, EntrezGene, Uniprot/Varsplic, PDB, wormpep_id, wormbase_locus, GO, wormbase_transcript, AFFY_C_elegans, Genoscope_annotated_gene, Genoscope_pred_gene, Genoscope_pred_transcript, Celera_Trans, EMBL_predicted, Uniprot/SPTREMBL_predicted, protein_id_predicted, Celera_Pep, Celera_Gene, Anopheles_symbol, Ens_Hs_translation, Ens_Hs_transcript, HUGO, RefSeq_peptide_predicted, RefSeq_dna, AFFY_Chicken, miRNA_Registry, ZFIN_ID, AFFY_Zebrafish, ZFIN_xpat, IPI, RefSeq_dna_predicted, cint_jgi_v2, cint_jgi_v1, cint_aniseed_v1, MIM, AFFY_Mouse430A_2, AFFY_Mouse430_2, AFFY_MG_U74Av2, AFFY_MG_U74Cv2, AFFY_MG_U74Bv2, AgilentProbe, MarkerSymbol, RGD, AFFY_RG_U34A, AFFY_Rat230_2, AFFY_RG_U34C, AFFY_RG_U34B, RFAM, flybase_transcript_id, gadfly_transcript_cgid, AFFY_Drosophila_2, AFFY_DrosGenome1, FlyBaseName_transcript, flybase_annotation_id, flybase_gene_id, FlyBaseName_gene, gadfly_gene_cgid, flybase_polypeptide_id, gadfly_translation_cgid, FlyBaseName_translations, Tiffin, MEDLINE, AFFY_Canine, SGD, Xenopus_Jamboree, AgilentCGH, AFFY_U133_X3P, AFFY_HG_U133B, AFFY_HG_U133_PLUS_2, AFFY_HG_U95E, AFFY_HG_U95Av2, AFFY_HG_U95B, AFFY_HG_Focus, AFFY_HG_U133A_2, AFFY_HG_U133A, AFFY_HG_U95D, AFFY_HG_U95C, OTTT, AFFY_HuGeneFL, AFFY_HC_G110 and CCDS.

Link to TreeFam

TreeFam encourages other websites add links to TreeFam. This can be done by transfer your gene accessions to TreeFam CGIs. Note that TreeFam only records accessions from WormBase, FlyBase, SGD, TIGR, GeneDB, and Ensembl, where TreeFam sequences were acquired. Here are some examples:

Now, various pages can also be accessed by providing external accessions. TreeFam will query Ensembl MySQL database, fetch Ensembl accessions and redirect to the related TreeFam pages. Note that sometimes one external accession may correspond to several Ensembl accessions. Only the first match will be used. To use this function, users should provide extac to specify the external gene accession and spec to indicate the species.

How orthologs are inferred in TreeFam?

In TreeFam, orthologs and within-species paralogs are inferred from phylogenetic trees. Two genes are said to be orthlogs if they are both present in a gene tree and their last common ancestor (LCA) in the gene tree is a speciation node. Within-species paralogs are more obvious from trees, and we further provide the ancestral species from which the paralogs were originated.

For a TreeFam-A family, we merge the constrained full tree and the clean tree to utilize curation information, and for a TreeFam-B family, we use the clean tree directly. We then use a multifurcated species tree to infer duplication events and orthologs. The species tree used here (below) contains more polytomies (multifurcated nodes) than the tree used to infer duplications displayed in TreeFam web site. Polytomies lead to the loss of some information, but they help to infer correct orthologs even if the gene tree contains wrong branches.

The `score' in the TreeFam gene page is bootstrapping supports rescaled around 100. It measures how often the same ortholog (or within-species paralog) pair can be inferred from a resampled tree. The higher, the better.

What is the goal of curation?

The goal of curation is to ensure that the phylogenetic tree for a family accurately reflects the orthology relationships and history of gene duplications and losses in the family.

Orthology, gene duplication and loss make sense in the context of a phylogenetic tree. However, automatic trees are often incorrect, either because of poor data quality or because the tree reconstruction algorithm assumes an unrealistic model of evolution. There is currently no tree reconstruction algorithm that can solve these difficulties. We believe that orthology and paralogy statements must be consistent with the tree we present. Therefore, to improve the accuracy with which TreeFam reflects the orthology relationships and history of gene duplications and losses in a family, our approach is that human experts manually curate the automatic trees. The curators only edit a tree if additional phylogenetic analyses and information such as gene function strongly suggest that the automatic tree is incorrect. We allow multifurcating trees if there is ambiguity.

As well as manually curating the topology of a tree, TreeFam curators assign a name and symbol to each family, and symbols to obvious subfamilies within a family. If possible, the HGNC name and symbol for the human gene in a family/subfamily are used to name that family/subfamily. If a TreeFam-B family contains up to three human genes, a provisional symbol is assigned to the family by concatenating the HGNC symbols of those human genes (for example, HGNC1/HGNC2/HGNC3). On the other hand, if a TreeFam-B family contains more than three human genes, it is assigned the temporary symbol `mixed'.

The curator also writes a short description of the function of the genes in a family, based on a review of the literature.

When is a tree said to be curated?

A very important step in curating the TreeFam tree for a family is annotating the nodes in the tree that are considered to be correct. That is, once the curator has finished editing the phylogenetic tree for a family, the curator marks the nodes in the tree that are considered to be probably correct with `C'. A node is marked with `C' if the curator is sure that (A) the subtree descending from that node contains every gene that it should contain (among the sequences already in the tree); and (B) the subtree does not contain any genes that it should not contain; and (C) the topology of the subtree is completely correct. If the curator has doubts about whether the node is correct, then the node is marked with `P' (putative).

In practice, when we mark nodes as `C' or `P', we only consider the topology with respect to the `core TreeFam species': human, mouse, rat, chicken, Drosophila melanogaster and Caenorhabditis elegans. We do not include the fully sequenced fish genomes (zebrafish and pufferfish) as `core species' because they contain many duplicated genes, and it is often difficult to be sure that these duplicated fish genes are correctly placed in a tree. On the other hand, we do not consider sequences from partially sequenced animal genomes from UniProt as `core species', because UniProt sequences are often fragmental sequences.

Should I remove some sequences? And how?

Our procedure for creating the automatic trees in TreeFam-B aims to ensure that the phylogenetic tree for a family only contains (A) family members that descended from a single gene in the last common ancestor of animals, and (B) the most closely related yeast and/or plant outgroup sequences. For most TreeFam-B families, our automatic procedure for doing this is successful, but it is not yet perfect. Thus, there are some TreeFam-B families that contain genes that descended from more than one (paralogous) gene in the last common ancestor of animals. There are also some genes that appear in more than one TreeFam-B family. However, our goal is that each animal gene should appear in just one TreeFam-A (curated) family. Therefore, when overlapping TreeFam-B families are curated as described above, we manually split them into two or more non-overlapping TreeFam-A families, each with its own tree.

How is the topology of a tree curated?

Manual curation is a key feature of TreeFam. During curation, experts manually correct errors in the automatic trees for TreeFam-B families. To curate a tree, the curator gathers phylogenetic and functional information on the genes in the family from journal articles; from manually curated databases such as UniProt, FlyBase, WormBase and OMIM; and from accepted species taxonomy in the NCBI database.

If the phylogenetic tree for a family differs from that expected from functional information, published articles or the accepted species taxonomy, the curator explores the plausibility of alternative tree topologies using a combination of published and in-house tools. For example, the Jalview alignment editor is used to display and edit alignments; and an extended version of the ATV tree viewer is used to display and edit phylogenetic trees. If a curator suspects that a tree is missing genes, BLAST and HMMER are run with non-stringent E-value cutoffs, to search for distant sequence matches.

The in-house tools developed for TreeFam include: (A) an algorithm that infers the nodes in the tree that correspond to gene duplications or gene losses; (B) an interactive program for tree curation, tctool, that allows the curator to visually adjust the gene tree topology and recalculate a score which reflects both how well the topology explains the sequence alignment and (optionally) how closely the topology agrees with the species tree; and (C) an alignment viewer that displays the positions of intron-exon boundaries with respect to a multiple alignment of the proteins in a family.

Last Modified Tue Mar 6 09:39:24 2007