Self updating map
The resulting RDF map of the TCGA contents is available ( rdf.s3db.googlecode.com/hg/TCGA.rdf), and can be efficiently traversed by a SPARQL engine to quickly discover which files document results that satisfy any number of the constraints recognized by the model.
For example, as illustrated in a webcast accompanying that manuscript ( GU4), one could identify which files describe patients from a specific cancer center that provided samples that were profiled for DNA copy number variation.
Simultaneous to the expansion of the TCGA, the tooling required for enabling computational ecosystems for data-driven medical genomics (Almeida, 2010) is maturing rapidly, to the point that tools operating within and providing such ecosystems are beginning to appear (Almeida et al., 2012b).
The concern that the web browser is computationally inefficient for advanced numerical procedures has also been amply overcome, as we found when identifying sequence analysis procedures making use of the Map Reduce (Dean and Ghemawat, 2008) distributed computing template (Almeida et al., 2012a; Vinga et al., 2012).
Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at Contact: - The Cancer Genome Atlas (TCGA) is a joint project of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) to comprehensively apply genome analysis technology to the study of the biomolecular basis of cancer (NCI Wiki, 2011).Concretely, this project has analyzed tumor and normal samples from over 6000 patients, which resulted in the collection and public availability of 37 types of genomic and clinical data for 33 cancers.More importantly, in 2011, there was a momentous change in the level of data interoperability of the TCGA data repository: data files are now available directly through HTTP calls to a central directory, located at Jac.This opens entirely new opportunities for interactive reproducible data analysis and visualization.