Support Links: Statistics | Help | FAQ

  1. About GeneSigDB
  2. Process: Collection of gene lists
  3. Process: GeneSigDB file structure
  4. Process: Mapping of gene lists to EnsEMBL genes
  5. Process: Mapping of gene lists - Special Cases
  6. Web interface to GeneSigDB
  7. Querying GeneSigDB
  8. Querying GeneSigDB: Publication Search
  9. Querying GeneSigDB: Publication Search Results
  10. Querying GeneSigDB: Gene Search
  11. Querying GeneSigDB: Wildcard Search
  12. Querying GeneSigDB: Advanced Gene Search
  13. Querying GeneSigDB: Gene Search Results
  14. Gene Signature Comparison
  15. Gene Signature View
  16. Downloading Gene Signatures
  17. Programmatic Access to GeneSigDB

1. About GeneSigDB

Gene expression studies typically result in a list of genes (gene signature) which reflect the many biological pathways that are concurrently active. We have created a Gene Signature Data Base (GeneSigDB) of published gene expression signatures or gene sets which we have manually extracted from published literature.

2. Process: Collection of gene lists

GeneSigDB was creating following a thorough search of PubMed using defined set of cancer gene signature search terms. For more information on the content of this version of GeneSigDB, see our database statistics page.

Table 1. Search used to generate list of published articles which might contain gene signatures

XXX
AND ("genechip" OR microarray OR "gene expression")
AND ("gene signature" OR "genelist" OR "expression profile" OR "Classifier" ?OR "Predictor")
AND English [la]
NOT Review [pt]

where XXX are the tissue or subtype specific search terms provided in Table 2.

 

3. Process: GeneSigDB file structure

Each manuscript was downloaded and manually curated to extract all gene signatures described in the publication. Gene Signatures were extracted from:

Gene signature tables were extracted and information about the source and contents of each gene signature (Table 3) were captured into an Excel spreadsheet template designed to capture gene signatures and associated annotation. Each gene signature was given the unique identifier, SigID was in the format PMID-X, where PMID is the pubmed ID of the publication and X is a table, figure, supplementary number e.g. (18593951-Table1, 16964388-SuppTable1).

Gene signature tables were converted to tab-delimited format and save as SigID.txt (eg 18593951-Table1.txt, 16964388-SuppTable1.txt). All files related to a publication are stored in the same folder (Figure 1). If there are multiple signatures from the same articles, all signatures are stored in the same folder, named PMID, where PMID is the PubMed identifier for that article. The folder is labeled by its PMID (Figure 1). Metadata associated with gene signature were extracted from the Excel file and stored as an xml file, SigID-index.xml. The elements of which are listed in Table 3. The XSD schema is available.

Table 3. Signature Annotation File
Item Description
SigID PMID-X where XX is the table, figure number
SigName Signature Name which is generally in the format Tissue_AuthorYearofPublication_NumberOfGenes
PMID publication PubMed identifier
Description the Figure legend or Table description
URL URL from where signature was downloaded or obtained
platform the technical platform (microarray) used to generate the signature
Platform description A description of platform
Comments comments by the authors or ourselves

All gene identifiers in a gene signature file (SigID.txt) are extracted and stored in a mapping file (SigID-mapping.txt) which are mapped to EnsEMBL gene using biomart, and are stored as EnsEMBL gene identifiers (SigID-standardized.txt). The biomart query history of how each gene was mapped is stored in the file SigID-maptrace.txt. The history detail is recorded whether the query was successful or not. This process is summarized below.

Overview of GeneSigDB curation process

GeneSigDB File Naming Structure

4. Process: Mapping of gene lists to EnsEMBL genes

Gene identifiers were extracted from each gene list and were mapped to EnsEMBL genes using biomart. Where multiple identifiers existed with a gene list, we ranked the identifiers that could be mapped (Table 4) and took the first match from the ranked list.

Table 4. Rank order of gene identifiers:
Identifier Comments
Probe ID High Priority
Clone ID
EnsEMBL ID
EntrezGene ID
RefSeq ID GenBank and RefSeq are often mixed
GenBank ID GenBank and RefSeq are often mixed.
Protein ID
CCDS ID
UniGene ID
Gene Symbol low priority

5. Process-Mapping of gene lists - Special Cases

In special cases, of custom arrays that were designed using sequences not present in the major sequence databases (e.g. affy_hu35ksuba, or the Rossetta Agilent arrays used by Van't Veer et al., 2002), for which we were able to obtain the target sequences, we used the sequences to map the probes directly to EnsEMBL transcripts.

Using a BLAST sequence similarity search, filtering the results for at least 95% similarity and at least 50% of the target sequence covered by the match to an EnsEMBL transcript. In the case of a successful match, we substituted the original ID with the corresponding EnsEMBL gene ID. We used the program megablast from the NCBI tools package with the following set of parameters:

-p94 -W18 -X16 -JF -v9000000 -b9000000 -C40 -D4 -FF -UF

6. Web interface to GeneSigDB

The web interface to GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) is based on HTML, CSS, Javascript, JSP, XML and Java 1.6 technologies. The application runs on an Apache Tomcat 6 web application server running on a CentOS 5 Linux server. Front-end interactivity makes use of the jQuery 1.4+ Javascript library and server-side processing is based on the Apache Solr Enterprise Search Server 1.4. GeneSigDB is based on a hybrid approach between traditional database technologies and Solr indexes to catalog and organize the data. Search functions are performed using the Solr server to create a high-performance search engine. Other site functions are often performed using a database backend. Both the database and the indexes contain the same information, but they are organized in different ways to take advantage of the relative strengths of each technology.

7. Querying GeneSigDB

There are two search engines to query GeneSigDB, a publication and a gene search. In the current release of GeneSigDB these searches cannot be combined, only one search type is performed. Therefore either a Publication Search or Gene Search can be performed.

8. Querying GeneSigDB: Publication Search

The publication search searches the full text of the pdf of articles and the Medical Subject Heading (MeSH) terms associated with articles indexed by PubMedto retrieve a list of publications and the gene signatures associated with them.

To search GeneSigDB Publication search, enter one or more of author name, article title, journal name, keywords, MeSH terms etc. As you type an Auto Suggestion box will descend which allows you to select search terms relevant to the publication collection and to what you are typing. You can select a search term by using the mouse and click on it or use the cursor buttons to navigate the list and hit return to select.

Upon selecting a search term from the drop down one is free to select addtional terms. Each term is treated as query term seperated by an AND operation.

9. Querying GeneSigDB: Publication Search Results

The search results are displayed in an interactive table, indicating how many results were found along with the first 10 results by default. The "Filter Results" column on the left displays search facets which can be used to constrain or further refine on the current search. We currently do not provide a means to expand a search.

Search results can also be refined by custom queries by inputting text into the Refine Search box.

"Selected Signatures" section displays the current selected signatures in the "shopping basket" . Initially this will show a basket to indicate the "shopping basket is empty". Under the Selected Signature window there are various actions that can be performed. Probably the most useful is "Add All" which will add signatures associated with all DISPLAYED search result (by default 10 search hits are shown) to the basket. The basket can be emptied using the "Clear All" button.

To add selected signature to the basket, click on the Search Results Table in the Signature Counts Column.
Note we do not query the gene signature itself, the search queries the publication, which can be associated with one or more signatures.

A relevant search facet can be selected by clicking on the text of the term or the green + symbol adjacent to it. Each facet has the amount of results in parentheses next to it.

When the facet is selected the term will be removed from the facet list and added to the Search Terms section of the Search Results box. The results table will also be updated to reflect the selected facet. The facet can be deselected by clicking on the X beside the term in the Search Terms section.

The results table is interactive, the Title of each publication is a link to a Publication detail page. The Authors cell is clickable and expands to show the complete authors list. Most importantly the Signatures cell expands to show a list of the signatures that have been curated from the Publication. Each signature has information regarding Organism, Platform and Tissue.

To select a signature click on the green + symbol, to view the detail on a particular signature click on the title of each signature and a section will expand. Upon selecting a signature the signature name will be added to the "Selected Signatures" section. Each selected signature name is a link to the Signature Detail page. The signature name has a format Tissue_FirstAuthorYear_NumberOfGene. The clear all button will clear the section of any selected signatures. Clicking on the Comparison button will compare the Signatures in a Comparison Matrix and Fisher Exact table. The Download button will allow one to download the data set based on the gene signatures.

10. Querying GeneSigDB: Gene Search

The Gene Search queries the annotation for all genes that are present in standardized gene lists in GeneSigDB. One can enter either a single gene or multiple genes into the gene search. One can also upload a list of genes using the advanced gene query option. Gene search terms can be gene symbols, EnsEMBL, Entrezgene, Affymetrix, GO Terms, Interpro Terms and Pathway annotations.

11. Querying GeneSigDB: Wildcard Search

By default we perform a wildcard search. So to retrieve both BRCA1 and BCRA2 genes, Simply Type BRCA in the search box. It will return all genes will begin with BRCA.

12. Querying GeneSigDB: Advanced Gene Search

A gene list (multiple genes) entered into the query box on the front page and should be space or comma separate search terms, as the default gene search links multiple entries by OR. However it is probably simpler to use the advanced query page. Click on advanced search option. One can simple cut and paste a one column list of gene identifiers. Alternatively a one column list of identifiers can also be uploaded using the upload a file option in the advanced query page. Both options permit a wide range of gene identifiers as listed above.

13. Querying GeneSigDB: Gene Search Results

Gene Search returns a list of genes. Genes from multiple species within GeneSigDB are displayed. For example, human, mouse and rat genes. Use the "Filter Results" panel on the left to refine Gene Search Results or enter your own terms to refine searches in the Refine Search box. The layout of the Gene Search results is very similar to that of Publication Results with the exception that one selects Gene's rather than Gene Signatures.

To select a gene click on the green + symbol in the Add Gene column, this will add a gene to the Selected Genes basket.

14. Gene Signature Comparison

To visualize a comparison of gene signatures, select multiple gene signatures from the Publication Search results page in the same method used in section 9. Then on the home page click Compare Selected. In the example below, we searched authors in the Publication Search results section for a comparison between random signatures. These gene signatures are passed to a gene signature comparison view. The same could be done for specific gene comparisons in the gene search.

This opens a gene signature comparison matrix in which the rows are genes and the columns are signatures; the elements of the matrix are colored heatmap-style red/white for present/absent. The default setting is that only genes present in two or more signatures are shown. There is a slider bar to interactively increase that threshold, to show only genes that are present in greater numbers of gene signatures.

A new feature in genesigdb release 4.0 was the ability to export images as publication quality images. To export an image, click on export png.

15. Gene Signature View

When one clicks on a Gene Signature Name one will be brought to the Gene Signature Detail page outlined below. This page contains some details specific to the Gene Signature and the gene listing itself contained within a results table. The Table by default will show the Standardized Gene List, although the view can be changed using the labeled buttons above the table.

Clicking the Original Gene Signature will give you a table of the gene signature as found in the publication with the orginal column headings.

By default genes that were unmapped to the standardized gene set will be hidden. If these genes are of interest, the "show unmapped genes" button will highlight the table with the unmapped genes in red. These unmapped genes can be view within the context of the Standardized Gene Signature by clicking on the button above the table.

One can add columns to the Standardized Gene List by clicking on the Add Original Columns dropdown located above the table.

After manipluating the table the data can be exported using the buttons on the upper right of the table. The options are to Print, Save as CSV, Copy to Clipboard and Save as Excel.

16. Downloading Gene Signatures

In the publication search results, gene search results, or gene or publication entry view, gene signatures can be selected for download using checkboxes which are passed to a download page. There, a user can choose to download the standard gene list (EnsEMBL gene identifiers and gene symbols) or can choose to convert gene signatures into one or many commonly used identifiers, including Entrezgene, ReqSeq gene identifiers or Affymetrix, Agilent or Illumina probe identifiers. There is no limit to the number of identifiers that can be selected or to the number of gene signatures that can be downloaded concurrently. Each gene signature is provided in a separate file and multiple gene signatures are compressed into one zip file.

17. Programmatic Access to GeneSigDB

GeneSigDB is capable of providing its functionality through a Java RESTful web service. Currently, there is no existing client implementation to consume these services. If you would like to use the REST services, you will need to implement an HTTP client or use a web browser. We are considering creating a ready-to-use client in a future release of GeneSigDB.

GeneSigDB provides REST services using the reference implementation of JAX-RS found at Glassfish (Jersey: https://jersey.dev.java.net/) to provide the REST HTTP functionality along with the reference implementation of JAXB, also at Glassfish (JAXB: https://jaxb.dev.java.net/), to do the XML transformation.

GeneSigDB provides REST services to retrieve each of the major objects in GeneSigDB (GeneSignature, Gene, and Publication) along with all of their ancillary member objects. These objects are in either XML format or JSON format. The REST request is made over HTTP by creating a URL with a key in it that will then GET the specified resource. The Accepts portion of the HTTP request header will determine the MIME (and format) of the response. If the request says that it will accept XML, that's what the response will be. If it says that it will accept JSON, then that's what the response will be. GeneSigDB also includes XML schema definitions for each of the objects that are potentially generated by the REST service. The links to the XSD files are located here.

Publication - Publications represent a PubMed publication with a PMID key. Our PMID keys match exactly the PMIDs found in PubMed. We do not have the full set of publications found in PubMed, but rather a subset of them as determined by our specific queries of PubMed to build our initial list of publications to curate and process. We believe we have most of the publications listed in PubMed that have a signature (some sort of list of genes) and that are cancer-related.
Gene - We have a mapped and annotated copy of every gene found in every signature in our dataset. Occasionally, there may be some mapping errors, so we don't have a perfect listing, but it is fairly extensive. Our genes are keyed by Ensembl Id and our Ensembl keys match exactly the Ensembl ids found in Biomart.
GeneSignature - This is the most important object, which bridges the Publication objects with the Gene objects as the GeneSignature is essentially a set list of genes from a given publication. Of course, each publication may have one or more gene signatures, so we have created our own separate key for gene signatures. The first part of the gene signature key is always the PMID of the article where the signature came from. Again, this is a straight mapping to PMIDs at PubMed. The second part of the gene signature key is a little more tricky. It generally consists of a table name or number or supplement name or number or some other description that is not necessarily easily identifiable from an outside source.
For example, one of our listed publications has an id of 16651414. This is also the PMID key for the article, "Gene expression profiling shows medullary breast cancer is a subgroup of basal breast cancers", by Bertucci, et al., located at PubMed. We also have three signatures listed for this publication. Their keys are "16651414-Supp2", "16651414-Supp3", and "16651414-Supp4", respectively. Each of these signatures has a long list of genes included and we have mapped their listings to a set of Ensembl identifiers. The keys for each of the genes in the signatures is an Ensembl Id.

GeneSigDB provides RESTful Java services to each of these object types by using a URL that includes the object type and its respective key identifier. These objects will be represented in XML or JSON format. On the client side, a URL should be built including the type of resource being requested and its key and which can then be submitted our web server. The web server will then respond appropriately. For example, if the object is found, the server will return a 200 Ok status code in the return HTTP header with a body that contains an XML representation of that given object. Also note, that if no object is found for a given key, the server will respond with a 204 Ok - No content response. (As a side note, if you are looking at these services using a web browser, 204 Ok - No content messages are the correct response to a not found situation, but some of the web browsers will handle 204 Ok - No content messages by not changing the Window contents at all. If this is the case, it looks suspiciously like absolutely nothing happened. However, if you check out the HTTP headers going across your machine, you will see that the 204 response was there and valid. Web browsers are not entirely the best way to test web services that are designed to be read and used by a client process).

In the example from above, each of those resources would be retrieved as follows:
Publication - http://compbio.dfci.harvard.edu/genesigdb/rest/resource/publication/16651414
Gene - http://compbio.dfci.harvard.edu/genesigdb/rest/resource/gene/ENSG00000012048
GeneSignature - http://compbio.dfci.harvard.edu/genesigdb/rest/resource/geneSignature/16651414-Supp2

Application Description WADL - http://compbio.dfci.harvard.edu/genesigdb/rest/application.wadl