IGSP's information systems are designed to minimize time spent on staging data for analysis and maximize efficiency of analysis. In practical terms this means that data access is continuous throughout the Institute, and datasets visible on a networked laptop are also available to thousands of CPU cores on the Duke Shared Cluster Resource (DSCR), a high-performance computational resource shared by researchers across the university. The IGSP computational team also offers specialized and exotic computational resources through virtualization, using both University and "cloud"-based resources.
IGSP's DNA Microarray Core, Proteomics Core, and Sequencing Core facilities use the centralized storage, so that IGSP scientists can easily acquire large datasets and set them up for analysis. The Microarray and Proteomics Cores use the "Express" data repository for data distribution and analysis for major projects. "Express" has been developed by IGSP programmers to ease data production activities in the cores and provide a true repository for data of abiding scientific interest. The system has been used to automate data storage and analysis, and it is being expanded to increase its flexibility and capability.
The Express Data Repository and the IGSP infrastructure have been recognized as models for supporting efficient and secure data management for large genomic data.
In terms of raw storage capacity, Duke's IGSP has the fourth largest data storage system in the Duke University and Health System enterprise. Data is backed up to disc, and mirrored to separate locations for disaster recovery. Storage and computational resources of IGSP are located in enterprise-level data centers with access controls, and primary storage for the Institute's labs and staff is housed in a data center that is manned 24/7 and has redundant and emergency power supplies.
Currently, individuals are granted 50 gigabytes of backed up standard storage space, and labs have access to 150 gigabytes of backed up high-throughput storage space. Labs and projects that require more storage can purchase additional storage to be added to the existing storage controllers and systems. The storage is designed to ease data sharing among the Institute's researchers.
IGSP has separate installations of NetApp FAS 3000 series filers in three Duke locations. Disc shelves attached to the filers use NetApp's high speed SAS architecture for high-performance storage and SATA disc for more capacious, moderate-performance storage.We also have a second type of storage that we use as a cheaper form of backup that also stores data offsite for restoration should a disaster occur. This storage consists of a Dell R710 server and shelves of MD1200 disc arrays. This is RAID 6 storage, and the setup features hot swap discs in the event of disc failure. Although the device is not backed up to a separate location, the system architecture is very failure resistant and has 24/7 local and vendor support. This storage is also quite inexpensive, costing in FY 2012/2013 $300/terabyte/year.
Computational power is tailored to fit researchers' requirements, and the infrastructure handles large and small projects. The infrastructure is designed to be flexible, with open access to IGSP researchers on computational servers outfitted with a broad range of bioinformatics and application development tools. Software not currently on IGSP machines can be installed on request.
Two 32 CPU-core Intel machines, each fitted with 128 gigabytes of RAM and 10gb/s network connections, are available to researchers with regular and basic demand for computation. Additional computational resources are available by arrangement for more computationally intensive projects, such as high-throughput gene expression microarray or sequence analysis. Access to these dedicated devices are restricted to specific research groups. Special provisions have been made on both the storage and the computational infrastructure for protected sensitive electronic information, such as datasets that fall under HIPAA and HITECH regulation.
The IGSP computation and core infrastructure uses Cisco UCS B200 series blades for a majority of our infrastructure, with a few speciality applications being handled by some older Dell M600 blade servers.
Currently high performance computation is executed on the Duke Shared Cluster Resource (DSCR), a computational cluster of over 4,000 CPU-cores. This cluster is directly connected to IGSP's storage infrastructure via dedicated 10 gb/s fibre ethernet, allowing for easy staging of large datasets. The cluster has all commonly used software, and systems administrators will install additional software on request. IGSP is a major contributor to the DSCR and has added computational servers funded by the NIH (grant number 1S10RR025590-01) and the North Carolina Biotechnology Center (grant number 2009-IDG-1002).
A new cluster with 480 CPU-cores is currently in development. This cluster is being designed with large sequence files in mind. The nodes that make up the cluster will all have between 128 and 256GB of ram, and the entire system will have the potential to transfer data between the storage and analysis nodes at speeds of up to 48gb/s greatly enhancing the ability of IGSP researchers to remain at the forefront of genomic science research.
The IGSP computational infrastructure is Linux-based, since Linux is a widely adopted and very reliable platform for computational biologists. Use of open source software is encouraged, though projects also use proprietary software when it fits their research needs.
The IGSP computational team also is trained in using so-called "cloud" technologies and can set up customized computational infrastructure for special purposes, including computational servers with Tesla "Fermi" GPU processors or machines fitted with up to 64 gigabytes of RAM. With assistance from the staff at the DSCR, and the CS department, the IGSP-IT team also has an infrastructure still in place from a previous grant provided by the Kimmel Foundation to make high performance computing services available to Duke Medicine researchers in a secure "local cloud" that is suitable for analysis of sensitive and protected data.
Commonly used software for sequence analysis, gene expression analysis, and proteomics is available to all researchers. The IT infrastructure is particularly well suited for application development by IGSP researchers, and a significant number of computational and software-development projects are underway, ranging from software for specialized analysis to enterprise-wide data management and data analysis systems.
Currently, IGSP IT staff are involved in externally funded projects to expand the tools used to establish data provenance and ensure the reproducibility of highly complex and computationally challenging genomic analysis.
IGSP's seven-member IT staff includes individuals trained and certified in Oracle and MySQL databases, systems administration, information security, and bioinformatics. A third of the staff hold advanced degrees. Programming and database staff have extensive training and experience in biology labs. Staff have education and work experience at a broad range of organizations, including the European Bioinformatics Institute (EBI), Duke, Northwestern, Western Kentucky University, and the Rochester Institute of Technology.
IGSP and basic sciences faculty members have found that the programming staff in particular have unique talents that can be used in various research projects, and it is common practice for PIs to include members of the IGSP IT team in their grant proposals for special analysis and development projects. This is typically done by including staff effort in the budget. Arrangements of this kind can be made by contacting Mark DeLong well before a grant is submitted.
Systems administration staff serve on a round-the-clock emergency on-call rotation for the Institute's main computational and storage infastructure. Desktop and routine service is provided during normal business hours.