See this page online at: http://www.biotechfocus.com/ManagingandLeveraging
Sign up for your subscription and keep up-to-date.
Stay updated on the latest news and technologies with Bioscienceworld's newsletters.
Five to choose from.
Whether a scientist is analysing a genome, developing new chemical compounds, reviewing medical scans, designing a new car, or ensuring the safety of our cities, managing massive data sets involved in leading-edge research and design continues to challenge users’ productivity and prevent breakthrough insights. Not only must scientists contend with an overwhelming volume of data, but they must also complete their work in less time due to today’s economic competitive environment. As the amount of data that can be collected grows, fully understanding and leveraging the data to advance research has become a bottleneck in R&D productivity.
Data Growth
This extreme data growth, to terabyte levels and beyond, is no surprise to industry experts who have long seen that the quest for increasing accuracy and detail would drive data sets to grow even faster than Moore’s Law. This growth rate has been fuelled by requirements for increased data precision and richness.
Greater precision often causes a two to 10 time increase in data sizes. In almost every field, data is being produced with more detail, like watching TV in high definition instead of standard definition. For example, MRI scanners now provide details to the millimetre rather than 10-mm level; X-rays are set to go from 2-D to 3-D; and car crash studies model every surface down to the 1 mm rather than as the large, simple blocks used 15 years ago. Scientists now routinely use imaging technologies to understand the impact of putative therapies on animals under treatment or on cells in culture. Confocal microscopy is a technology that allows users to visualize molecular tags targeted to proteins expressed on living cells that may be important in the intervention of disease processes. Confocal images exist as slices that can require enormous data storage when combined over a series of targeted molecules for each therapeutic program multiplied over the multiple initiatives of an R&D organization.
Adding more richness to the information, which can cause data increases by two to 20 times, is like the difference between watching TV in colour and night vision rather than just in black and white. Database explosion in the sciences can be experienced in increased “precision” of known genes as more information is annotated to the genome databases for hundreds of species.
Managing and Sharing Data
In most drug-discovery organizations where numerous scientists are working on various projects, a variety of computational servers are used to fulfill the requirements for each project. This infrastructure requires a transparent sharing of data across each platform. In a typical environment, a storage area network (SAN) acts as an independent network of storage devices that share data to servers which exist on the local area network (LAN). While the performance of data movement among servers can be easily recognized on a SAN, it is important to note that although data for each server resides on centralized and shared disks, it typically resides on partitions of disk space specifically set aside for each server.
There are scenarios where particular data, residing on a partition owned by a server, must be physically moved to be analysed on a different server. This may be due to an analysis that can only be performed on the new server. To achieve this data movement in a typical SAN environment, the data must be passed across the LAN, which, if the data is large, can be very slow and can overwhelm the organization’s LAN environment. New and more elegant solutions, such as SGI® InfiniteStorage File System CXFS™, can allow for immediate and transparent access to the data by the most popular server and workstation types across the SAN. This is achieved by allowing the data to reside on a disk that is not partitioned for each server and by having “clients” for servers so that they can directly access the data. Most importantly, for life sciences research, this data storage and sharing infrastructure allows for transparent, fast and efficient data management across an organization, thus empowering the scientists to do whatever they need whenever needed.
Data Analysis
To handle such large amounts of data, users often face a compromise. This typically means researchers have to settle for long wait times. To affect this, they can know more about a smaller region of the problem, or they can look at two important aspects together, but at half the resolution. This kind of trade-off defeats the purpose of having greater precision, richness or completeness in the first place since using sub-samples of data often results in dangerous gaps in understanding.
Imagine having to understand a book when all that is available is every 10th chapter (a true view of a small section); or every 10th word; or every 10th page (sub-sampling the data at different resolutions). No matter how you play the trade-off, it is hard to rapidly develop a good and complete understanding. Even worse, in most real situations it is not possible to know what was missed. Maybe the chapter with all the critical information was the one that was not read. In the end, these limitations either directly or indirectly create limits to the size and nature of the problems scientists believe they can solve. Every good scientist and manager knows that the most frequent reason that smart people make poor decisions is that they are unaware of what they don’t know. Using a computational system incapable of analysing the entire data, despite its size, limits productivity and creativity and, ultimately, decreases an organization’s competitive advantage.
Today, scientists and project managers realize that each of the many types of research fields involved in the drug-discovery workflow has its own set of data analysis requirements. Determining the optimal computational environment for each analytical algorithm now depends upon an examination of its requirements such as:
Choosing an inappropriate solution can lead to sub-optimal performance and, in today’s competitive environment, a non-competitive position.
It’s clear that the difficult questions asked of the big data can require abilities that include and, most likely, extend beyond those in a typical research computing infrastructure. Those capabilities may arise from a wide variety of applications such as instantaneously generating data from clinical trial research, real-time analysis of complex image data, routinely comparing whole genomes, or regularly modelling large complex molecules to meet aggressive timelines. Scientists and managers are now realizing that to achieve a competitive advantage in the drug-discovery market, these types of previously unthinkable problems must now be solved in a production environment.
While it is clear that the typical capabilities offered by a cluster solution will continue to support the more basic computing tasks, additional computational capabilities will be required to solve the state-of-the-art problems and analyse the enormous amounts of data collected and stored across the enterprise. One effective way to achieve this goal is to change the paradigm from the microprocessor-centric computing to a more capability-defined device where productivity and production are the “gold standard” definitions of performance. This new model is already being forced with the introduction of multi-core microprocessors and blurring of the historical and clear speed-performance paradigm.
Beyond changes in microprocessor performance and the definition of a “core,” large globally addressable memory will become equally important in effectively analysing large complex data. A large, single memory core that is connected to a series of device options can help assure all components communicate efficiently if an extremely high-speed connection between the memory core and the peripheral devices is achieved. The types of peripheral devices can vary widely and, as a result, can provide flexibility to the users. For computation, one of the connected devices could be a variety of solutions such as traditional microprocessor systems, hardware accelerated devices such as a field programmable gate array (FPGA), or even vector processing units. Real-time visualization could be accomplished by just connecting a visualization module to the core memory and providing near-local access to the computational devices.
In addition to the flexibility and performance characteristics of this model, memory-centric computing provides researchers with a tool to let them address large problems in ways that are beyond the capabilities of other systems. For instance, scientists can put whole databases in memory core to speed through enormous calculations. Or, if the timeline requires, analysis of that data can be moved from the traditional microprocessors to a hardware-accelerated device such as a FPGA. Or, for instances where a project team needs real-time results from a calculation that typically takes hours to complete, the results from the entire pre-analysed database can be placed to the core memory so that answers to queries can be provided instantaneously. With the flexibility and performance offered by a memory-centric device, the options for extending life sciences research are only limited by the imagination.
Putting it All Together
The enormous amount of data and effective data management generated in modern drug-discovery environments can be overwhelming and considered a hindrance to progress. Clearly, this reaction is due to an inability to effectively manage, share and analyse the data in a flexible way designed to empower the scientist. Moving from a state of “overwhelmed by data” to a state of “empowering scientists to transparently leverage data for productivity” requires a shift from a “data storage” approach to a “data and compute sharing infrastructure” approach. Using proven technologies that exist today, developing the new paradigm is easier to achieve than just a few years ago. Further, pre-existing infrastructure can be incorporated into the new environment so that previous investments can be further leveraged.
Unlike typical solutions for computational drug-discovery research, the core of this technology is not a new computer, a microprocessor or a new workstation. Instead, it is a data storage infrastructure that allows transparent sharing of data across the enterprise. Envision logging into your workstation or PC and clicking on folders where important data resides. That data would exist on a storage device that may be generated by an instrument from another part of campus and “owned” by a server that the end user never heard of. All of those technical details would be transparent to the end user scientist because once he opens the folder, the data is there for him to access, move and analyse, empowering the scientist to perform science.
Dan Stevens is the business development manager for Life, Chemical and Materials Science Research at Silicon Graphics Inc. (Mountain View, CA). His responsibilities include developing high-performance computing, visualization and storage/data management solutions for customers involved in biotechnology, pharmaceutical, chemistry and materials research. Stevens’s formal education includes a doctorate in dental medicine from the University of Pennsylvania (Philadelphia, PA) followed by a clinical specialty in periodontics and a PhD in immunology from the University of North Carolina at Chapel Hill (Chapel Hill, NC). Before joining SGI, Stevens worked at Procter & Gamble (Cincinnati, OH) as a senior scientist in the Health Care Business Unit, where he was responsible for identifying and developing product initiatives and managing clinical trial project programs.