Gclust databases

Databases of homolog tables of many organisms, automatically constructed but from a biologist's viewpoint.

Developed by Naoki Sato


This page just describes the Gclust software and Gclust databases.
Use Gclust server to see how phylogenetic profiling with pre-calculated clusters is useful in gene finding.


A document in Japanese is also available as a PDF file. Click here to download.

Brief summary of the Gclust software
Gclust software was developed to make clusters of protein sequences from all predicted protein sequences in a selected set of genomes. The clusters are homolog groups, but not ortholog clusters (see below for the distinction), and therefore, contain all homologous sequences encoded by the selected genomes. An ortholog cluster, such as the one in KEGG Orthologs or COG in NCBI, contains only a single sequence for each genome, and such a single representative is usually selected by a criterion called "bi-directional best hit". By contrast, a homolog group contains all reliable homologs, that represents a gene family. However, we need several techniques (see the next section for specialists) to obtain good homolog groups, because a very large group of sequences consisting of unrelated sequences could be formed if similar sequences are simply added to a homolog group. Gclust uses E-value of BLASTP and overlap score (representing the proportion of homologous regions shared by two sequences) as a two-dimensional matrix, to select for the proper E-value and overlap score for each homolog group so that not too many homologs are put into the group. To do so, number of organisms is also considered. Detailed explanation of the algorithm was presented in the GIW2005 paper.

Gclust databases
The Gclust software can be used with any set of genomes. We have been working on photosynthetic organisms, and are interested in finding conserved proteins in prokaryotic and eukaryotic photosynthetic organisms. Therefore, initial datasets included mainly photosynthetic organisms with some non-photosynthetic organisms for comparison. The datasets, CZ16Y, CZ20x, CZ30 and CZ35, are such datasets including different number of genomes. The results of computation are now available for the public through a web interface. The dataset Bact129 includes 132 species of bacteria. The dataset ALL145 includes animals and plants (including algae) as well as many bacteria and Archaea. Organellar genomes are also included, but selection is only possible on organisms (nuclear genome and mitochondrial genome, as well as chloroplast genome, if present). For organellar studies, datasets including all available chloroplast genomes (plus cyanobacterial genomes) or many mitochondrial genomes of photosynthetic organisms are also provided in this web site. They are named CPBACT8x and Mt23, respectively.

Uniqueness of the Gclust software
Many researchers use BLASTP to search homologous sequences in the non-redundant databases. But the results are usually difficult to interpret, because many similar sequences rank high. There are sometimes duplicated entries of an identical sequence. The Gclust databases are pre-calculated similarity matrices, which show all homologs in the selected dataset. Users do not need to perform an iterated BLASTP search.

If you are not satisfied with the provided clusters, ...
it is time to make clusters by yourself. Use the Gclust software to make clusters from a genome set containing your favorite organisms. Currently, detailed usage of the Gclust software as well as its associated software is being prepared.

Use of the Gclust software (for computational biologists)
The Gclust software is written in C, and runs on any common UNIX platforms including Mac OS X. Memory requirement depends on input data and mode of operation. When all the 102,513 predicted proteins encoded in the four eukaryotic (including organelles) and 13 prokaryotic genomes are clustered, about 9 GB memory was used on SGI Onyx3400. The computation of the ALL145 dataset required more than 2 weeks using the supercomputer system in the Human Genome Center at the University of Tokyo.
   The input file is a summary list generated by a perl script, bl2ls3.pl, from the output of all-against-all BLASTP analysis over the entire protein sequences. Very large proteins are excluded from the analysis, because they cause unusual clustering of unrelated proteins. In the current version of Gclust, the BLASTP results as parsed as described above, are processed in the following two steps: first, the data are read and partially transformed in internal data format. In the save mode, these data are saved in a large file 'data.out' for further analysis with varying parameters. A 'tapering' or 'ashikiri' option is provided to remove low homology data, with keeping low homology data for short sequences (from 1e-6 for >100 aa to 1e-3 for <40 aa).. In Gclust, homology data are handled as a chunk called sqlist, holding region to region similarity, namely, coordinates of similarity region in both (query and target) proteins and E-value. Therefore, a combination of two proteins may have many sqlist data, depending on the domain structure and repeat.
   In the second step, Gclust reads the data.out file, and performs clustering according to various options. However, the most useful option is the -clique option, which produces a good clustering result in relatively short time (within one day). In the clique mode, the sqlist data are converted to match data, which hold data of binary (i.e., protein to protein) similarity, namely, best E-value among sqlist, overlap score showing total overlap region devided by total length, and domain structure estimated from homology segments. Normally, clique mode requires org_list data, listing organisms. For each protein, all match data are tabulated in 2D, using E-value and overlap score. Match data are selected one by one starting from the corner with the highest E-value and highest overlap score. Various criteria are applied, but essentially, a clearly defined cluster of match data with respect to E-value and overlap score is selected. In addition, match data are selected to cover as many organisms as possible but without picking up very low similarity data.  After such purification of match data, idlist holding list of IDs of homologs is made for each proein. The threshold E-value and overlap score are also stored. Then, homolog clusters are formed by merging individual idlists. At this stage, idlists with very diffent threshold E-values are not merged. After a repeat of merging and removing, isolated proteins generated by removal step are again incorporated into the most adequate cluster. Homolog groups are sorted according to the number of entries. Finally, homolog groups are printed out to a large file as a canetated similarity matrix. The matrix may be expressed in 1 (similar) - 0 (dissimilar), E-value, or overlap score, depending on output options, 1, r, or s, respectively.

Using a perl script homologtableG4.pl, the homology matrix can be transformed into a table showing members of each homolog group.

Then, tbsort2 software (written in C) is used to select homolog groups that are conserved in a selected set of organisms. We call this "phylogenetic profiling", which may be useful to extract conserved proteins of unknown function, which might be involved in the pathway or process that are shared by the set of organisms. We apply this principle to extract "chloroplast proteins of endosymbiont origin" or CPRENDOs. But other usage of the phylogenetic profiling might be possible.

-- additional old explanation --

In the basic mode with the -hom option, single-linkage clustering is performed with an E-valueas a threshold. In this case, all the homologues that are linked by a single homology relationship are placed in a single cluster. Such clusters are used as discrete characters to make a parsimony tree (using the PAUP software) that we call 'genome tree'. With -repeat option, an iterated clustering is performed by changing the threshold E-value from the lowest (such as 10-50) to the highest (such as 10-3). During the iteration, an abrupt increase in the number of members of a cluster is taken as a sign of formation of unnatural cluster including distantly related or multidomain proteins. An additional criterion is the overlap score, which is defined as the sum of length of homology regions over the entire sequences divided by the sum of the lengths of the two sequences to be compared. Another criterion is the complexity of domain structure, which is estimated based on the BLASTP data and which is used to eliminate multidomain proteins during the initial iteration. By these criteria, concise or natural clusters are extracted and removed from further clustering with higher E-values. In an additional mode with -homsub option, the final clusters are further sub-clustered to maximize similarity within each subgroup.


Example homolog group

Image file i

Example alignment


Alignment image is about 250 kb.


References

N. Sato, M. Ishikawa, M. Fujiwara and K. Sonoike (2005)
Mass identification of chloroplast proteins of endosymbiont origin by phylogenetic profiling based on organism-optimized homologous protein groups.
Genome Informatics 16: 56-68.

N. Sato (2003)
Gclust: genome-wide clustering of protein sequences for identification of photosynthesis-related genes resulting from massive horizontal gene transfer.
Genome Informatics 14: 585-586.

N. Sato (2002)
Comparative analysis of the genomes of cyanobacteria and plants.
Genome Informatics 13: 173-182.



Distribution of data and software is now possible from the gclust server.


Last update: June 15, 2006.

Back to Sato Laboratory Home Page