GCB's information systems are designed to minimize time spent on staging data for analysis and maximize efficiency of analysis. In practical terms this means that data access is continuous throughout the Center, and datasets visible on a networked laptop are also available to thousands of CPU cores on the Duke Shared Cluster Resource (DSCR), a high-performance computational resource shared by researchers across the university. The GCB informatics team also offers specialized and exotic computational resources through virtualization, using both University and cloud-based resources
In terms of raw storage capacity, GCB has one of the largest data storage systems in the Duke University and Health System enterprise. Data is backed up to disc, and mirrored to separate locations for disaster recovery. GCB storage and computational resources are located in enterprise-level data centers with access controls and redundant and emergency power supplies. Storage is designed to ease data sharing among the researchers across campus and with collaborators elsewhere. Facilities include separate installations of NetApp FAS 3000 series filers in three locations. Disc shelves attached to the filers use NetApp's high speed SAS architecture for high-performance storage and SATA disc for more capacious, moderate-performance storage. A Dell R710 server and shelves of MD1200 disc arrays provide less expensive storage for projects where performance is not a premium, as well as offsite backup storage for restoration should a disaster occur.
GCB’s hardware provides computational power tailored to fit the requirements of specific projects across scales. Basic projects can be run on Cisco UCS B200 series blades and Dell M600 blade servers. Additional computational resources are available by arrangement for more computationally intensive projects. Access to these dedicated devices are restricted to specific research groups. A cluster with 480 CPU-cores is designed for projects that demand large data files, many cores, and fast transfer times, such as de novo genome assembly or mapping reads for multiple RNA-seq libraries. Nodes all have between 128 and 256GB of ram, and the entire system has the potential to transfer data between the storage and analysis nodes at speeds of up to 48gb/s. Additional high performance computation can be executed on the Duke Shared Cluster Resource (DSCR), a computational cluster of over 4,000 CPU-cores. This cluster is directly connected to GCB's storage infrastructure via dedicated 10 gb/s fibre ethernet, allowing for easy staging of large datasets. The cluster has all commonly used software, and systems administrators will install additional software on request.
Commonly used software for genome and transcriptome assembly, variant identification, proteomics, and gene expression and other kinds of functional genomic analyses is available to all researchers. The physical infrastructure is particularly well suited for application development by GCB researchers, and a significant number of computational and software-development projects are underway, ranging from software for specialized analysis to enterprise-wide data management and data analysis systems. The GCB informatics team is trained in cloud technologies and can set up customized computational infrastructure for special purposes, including computational servers with Tesla "Fermi" GPU processors or machines fitted with up to 64 gigabytes of RAM.
The Proteomics and Metabolomics Core, Genomic Sequencing Core, and Microarray Core facilities use GCB’s centralized storage, so that researchers can easily acquire large datasets and set them up for analysis. The Microarray and Proteomics Cores use the "Express" data repository for data distribution and analysis for major projects. "Express" has been developed by GCB (formerly IGSP) programmers to ease data production activities in the cores and provide a true repository for data of abiding scientific interest. The system has been used to automate data storage and analysis, and it is being expanded to increase its flexibility and capability.
GCB’s computational infrastructure is Linux-based. Use of open source software is encouraged, although some projects use proprietary software as needed. Infrastructure is designed to be flexible, with open access to GCB researchers on computational servers outfitted with a broad range of bioinformatics and application development tools. Software not currently on GCB machines can be installed on request. Special provisions have been made on both the storage and the computational infrastructure for protected sensitive electronic information, such as datasets that fall under HIPAA and HITECH regulation. Currently, GCB informatics staff are involved in a variety of externally funded projects to expand the tools used to establish data provenance and ensure the reproducibility of highly complex and computationally challenging genomic analysis