|
The NCI caBIGTM project is creating a common, extensible informatics platform that
integrates diverse data types and supports interoperable analytic tools. This
platform will allow research groups to tap into the rich collection of emerging
cancer research data while supporting their individual investigations.
However, because many software applications utilize non-overlapping sets of genomic
identifiers in their object models, they won't interoperate. GeneConnect
is a caBIGTM mapping service that makes this interoperability possible by
interlinking approved genomic identifiers. These include:
- Ensembl Gene ID
- Ensembl Transcript ID
- Ensembl Protein ID
- Entrez Gene ID
- UniGene ID
- GenBank mRNA Accession Number
- GenBank Protein Accession Number
- RefSeq mRNA Accession Number
- RefSeq Protein Accession Number
- UniProtKB Primary Accession Number
To interlink all of these identifiers, database annotations (either direct or inferred)
and an alignment engine have been used to construct pairwise connections, and then
all-to-all relationships have been calculated by traversing all possible combinations
of edges in the graph (See Figure) using every node as the starting point. For each query,
composed of one or more input identifiers and a set of paths that may be traversed,
the Path Score and Frequency are calculated. These are defined as:
- Path Score: Path Score is calculated for each set of genomic identifiers in
the result set. The Path Score is the frequency that a given set of genomic identifiers
was obtained across all traversed paths, given the query criteria composed of one or more input
identifiers and a set of paths that may be traversed.
- Frequency: Frequency is calculated for each genomic identifier in the result set.
The Frequency denotes how often a given genomic identifier was obtained from a given data
source across all traversed paths, given the query criteria composed of one or more input
identifiers and a set of paths that may be traversed.
|
|
|
GeneConnect Build Information
|
| Number of pairwise links |
42 |
| Number of distinct genomic identifier sets |
22162231 |
| Number of possible paths through the GeneConnect graph |
4106 |
|
|
Database Version Information
|
| Ensembl |
Version 40 |
| UniGene |
HomoSapiens Build#194 (26-July-2006) |
| EntrezGene |
HomoSapiens Build (1-August-2005) |
| GenBank Nucleotide |
Data currently not available |
| GenBank Protein |
Data currently not available |
| UniprotKB |
Version 8.0 Release(30-May-2006) |
| RefSeq |
Release 18 |
|
|