This paper describes ‘PathGrid’—an analysis and data integration system, developed initially to meet the demands in the analysis of medical microscopy imaging data. An overview of the current system is given, describing the techniques used in developing the data handling infrastructure and the analysis algorithm development. The use of software created in the context of systems designed for the astronomy domain is noted, specifically infrastructure from the astronomy virtual observatory movement for data discovery, access and workflow management, and astronomical image analysis software adapted for the analysis of high-throughput astronomy imaging surveys. This paper notes the applicability of the techniques from the astronomy domain. The testbed infrastructure deployment is described, emphasizing its speed and ease of use and support. The validity of the analysis techniques is confirmed through the pilot study described here—with the application to a large sample of immunohistochemistry microscopy data obtained in part for assessing the oestrogen receptor status of breast cancers. The analysis showed that the specificity and sensitivity values for the automatic scoring using PathGrid were within the errors of those obtained via a ‘gold standard’ manual pathologist scoring.
Microscopy of clinical samples has recently begun to generate large image datasets because of the increasing availability of high-throughput automated scanning microscopes and the creation of tissue microarrays (TMAs). TMAs enable the analysis of hundreds of tissue sections on a single slide, resulting in the conservation of tissue and the reduction in inter-experimental variability. Tissue samples collected from patients during surgery and preserved in paraffin (donor blocks) are reviewed by pathologists, who identify tumour tissue and normal control tissue from the samples. Cylindrical cores (usually less than 1 mm in size) are cut from these samples and placed in an array for analysis by immunohistochemistry (IHC) using antibodies to detect a panel of candidate biomarkers. The subsequent manual scoring of TMAs by a trained pathologist is a major bottleneck in their analysis and there is a need for automated approaches to image analysis to provide increased throughput and objective assessment of biomarker expression.
These high-throughput methods, e.g. IHC, underpin research into the discovery and validation of new predictive markers for cancer (e.g. Brenton et al. 2001, 2005; Callagy et al. 2003, 2008; Ahmed et al. 2007; Rexhepaj et al. 2008). New research is increasingly moving towards exploiting a systems approach to pathology (e.g. Cordon-Cardo et al. 2007; Donovan et al. 2008) to discover new predictive markers. This integrative strategy combines morphometric analysis of cancer cells and tissues with other complex molecular datasets and outcome data from clinical trials, together with a range of tools and related information such as genomic information from curated databases. Applying this approach to microscopy promises to accelerate the rate at which new biomarkers can be evolved from discovery and quickly applied in the clinic using standard pathological workflows.
Building on techniques developed in the astrophysics domain, the PathGrid project has developed an integrated data analysis and access system specifically designed to handle effectively the wide range of complexity inherent in microscopy data. The PathGrid solution is fully open, scalable and extensible, and thus is relevant for use in environments where data input and access is required to large, distributed, heterogeneous data.
In this paper, the focus is on the technical basis of the PathGrid system, which is allowing it to provide the vital technological workbench for discovery, supporting the systems pathology approach. The overall system architecture is discussed in §2. The description of the use of analysis algorithms developed initially for use in the astronomy domain is given in §3. The current PathGrid testbed system is outlined in §4, while experience gained from the initial use and deployment of the PathGrid system is given in §5. The paper closes with conclusions and outlook in §6.
2. PathGrid architecture
The PathGrid (http://www.pathgrid.org) system is composed of a number of distinct components. In the following sections, the use of a data handling infrastructure originally developed through the astronomy virtual observatory (VO) is described (§2a) along with detail on how it has been adopted specifically for PathGrid (§2b). The workflow management system that supports the development and execution of complex workflows is discussed, together with a brief indication of how the processing chain will in future be integrated with a database management system. A full discussion of the analysis algorithms is left until §3.
(a) The virtual observatory
The VO initiative in astronomy has been developed to meet the specific challenges resulting from the rapid growth of data in astronomy, both observational and model data.
Historically, astronomy has been an observationally based science. Study of the cosmos has enabled a better understanding of the physical processes at work in the Universe, and thus allowed astronomers to answer a range of key questions: from how the Universe formed at the time of the Big Bang, through the formation and evolution of galaxies, to the properties of terrestrial extra-solar planets.
A range of large observatories, both ground based (e.g. the European Southern Observatory telescopes; http://www.eso.org) and in space (e.g. the Hubble Space Telescope; http://www.stsci.edu), are producing observational data across the wavelength domain. Technological advances in areas such as detectors have enabled the sky to be observed across the full range of the electromagnetic spectrum. These new observational facilities generate significant data volumes and this coupled with an increasing need to combine data from differing wavelength regimes (e.g. X-ray and infrared data) leads to significant data and computational challenges.
The VO movement emerged in 2001 with the aim to create a global system to provide uniform access to this distributed data, with project initiatives in the USA (the National Virtual Observatory) and Europe (the European Virtual Observatory) leading the way. In order to coordinate the development of interoperating data services, the International Virtual Observatory Alliance (http://www.ivoa.net) was formed in 2002 (Genova et al. 2002) by representatives from the major VO projects. It has successfully developed a number of interoperability protocols (see http://www.ivoa.net/Documents which gives links to these standards) upon which the VO implementations have been built. This ensures that those VO systems are able to access data and applications provided by data centres publishing their resources conforming to these standards.
In the UK, the AstroGrid project (http://www.astrogrid.org) generated a set of interoperating infrastructure components to enable the publishing of data and applications in a secure environment. AstroGrid was funded over the period 2001–2009, being a consortium (as of 2008) consisting of participating groups from the universities of Bristol, Cambridge, Central Lancashire, Edinburgh, Leicester and Manchester and the Rutherford Appleton Laboratory. From 2009 further development of this infrastructure is being carried out within the context of a number of project initiatives at the European level including the Euro-VO (http://www.euro-vo.org) and the Virtual Atomic and Molecular Data Centre (http://www.vamdc.eu). This ensures future sustainability and continued technical support of the system.
The AstroGrid system is interoperable with data and application services published more generally by a wide range of data centres located globally in the USA, Europe and elsewhere. An astronomer science user of the AstroGrid system is able to make use of the VOExplorer client (Tedds et al. 2008), as a tool to search for and discover relevant data and application resources. Queries and manipulation of these data can then be carried out using inbuilt user interfaces relevant for specific service interfaces, or by interoperating clients (handling for instance data visualization). Figure 1 shows an example use case where astrometric data from the Hipparcos data are discovered through VODesktop, retrieved from the data centre and displayed in a connected desktop visualization tool, all achieved in a short sequence of simple actions through uniform interfaces. In this manner, a comprehensive range of data is available to any astronomer through a single interface.
The AstroGrid software is available from http://www.astrogrid.org and is published as open source software (with an Academic Free Licence). Walton & Gonzalez-Solares (2009) and references therein describe the AstroGrid system and its use for astronomical research (e.g. Walton 2005).
(b) The virtual observatory applied to PathGrid
As noted in §2a, the PathGrid Service-Orientated Architecture (SOA) is based on that developed in the context of the AstroGrid VO and Euro-VO projects.
AstroGrid provides software components to make Web services for resource discovery (registry component), virtual file storage (VOSpace), database access (DSA/catalogue) and application execution (CEA application-server). The latter two components wrap, respectively, a relational database and data-processing modules developed by PathGrid, providing Web access to those functions. A clear separation, with formal interfaces, is maintained between the AstroGrid code and the PathGrid code. This allows independent maintenance and development.
The PathGrid application modules for server-side application are called by the Web-service wrapper through a Unix command-line interface. The application modules themselves do not contain Web-service code and need not be written in a language that supports Web services. Further, the application modules are separate programs and can be written in different languages, which makes it easier to incorporate legacy software.
The Web services use both Simple Object Access Protocol (SOAP) and representational state transfer (REST; Fielding 2000) styles. They can be called from desktop applications specific to PathGrid (written in a wide choice of high-level and scripting languages), from generic clients provided by AstroGrid, from other Web applications or from the Taverna workflow system. The AstroGrid code in the Web services handles access control and access to data in VOSpace. The AstroRuntime component is a client-side library supporting access to these services. It encapsulates the details of the Web-service protocols and can itself be called from most languages.
We note the evolution of the VO-based infrastructural components. In earlier implementations, there were significant overheads inherent in the Web service-based approach. However, there have been significant improvements from earlier implementations. In the first place, the Java-based implementation has benefited from significant improvements in the Java Virtual Machine, especially with Java SE 6 (http://java.sun.com/performance/reference/whitepapers/6_performance.html). The SOAP interfaces have been optimized. In some areas, the use of the REST style model (Fielding & Taylor 2002) has allowed for a significant simplification of the interfaces with resultant improvements in speed. Finally, the evolution of the workflow enactment engine (Taverna) has seen a significant focus on optimization of performance. The recent releases (thus 2.1) reflect this and now show that the use of the workflow enactment engine adds only marginal overheads to the execution times of complex workflows (see Taverna 2.1 documentation: http://www.taverna.org.uk/documentation/taverna-2-1/release-notes/). These factors, coupled with our experience from use of the testbed systems as described in actual use in §5 demonstrate that the PathGrid architecture is suitable for the scale of data inherent in this domain.
(c) Workflow management
In order to allow for the construction and management of a set of processing services into one data analysis pipeline, the PathGrid system contains a workflow component. This is based around the Taverna (Hull et al. 2006; Oinn et al. 2006) workflow management system.
Taverna is a set of tools for designing and running workflows. It consists of a server (or client)-based enactment engine and a desktop client (the Taverna Workbench). It was originally developed for use in the bio-informatics realm but has now been taken up for use across a wide range of disciplines from biology, chemistry, medicine, to astronomy and the social sciences among others. Taverna has a datamodel view of workflow. It can invoke various types of services, local java classes, standard WSDL described Web services, ‘grid’ services.
The PathGrid implementation makes use of the Astro-Taverna (Walton et al. 2008) plugin for Taverna to construct and enact the complex processing chains. This processor plugin for Taverna provides the interface to the AstroGrid AstroRuntime (Winstanley et al. 2007) thus allowing for the integration of the PathGrid VO-based data and application services.
With a PathGrid workflow (see an example in figure 2) the user can execute a multiple-step pipeline as a single-click operation. This covers the login process, file transfer, image conversion, image analysis, generation of catalogues and storage of resulting images and files on local or virtual file and database systems. It provides computing scalability, interface to CaBIG (which now also interfaces Taverna; see http://cabig.nci.nih.gov/tools/taverna) and automatic handling of submission to ‘grid’ and ‘cloud’ clusters.
In a future development to facilitate greater sharing of the research process, the myExperiment (De Roure et al. 2009) virtual research interface will be offered for storage and sharing of packaged processing Astro-Taverna PathGrid workflows. This is a powerful virtual research environment that makes it easy to find, use and share scientific workflows, and thus will provide a useful underpinning ‘sharing’ technology for the growing numbers of users of PathGrid workflows and services.
(d) Database management
A single TMA slide typically contains several hundred tissue samples (cores), each originating from a single donor (i.e. patient) block. It is important to relate the final image analysis results from the PathGrid system to the original slides and tissue cores to enable the results to be integrated with the clinical and pathological data. An algorithm to detect positively stained nuclei in IHC tissue images has been developed (§3). The number of detected nuclei in a 0.6 mm core image is typically of the order of 1000. A database schema was designed to record selected output parameters from the image analysis, including the number, position and intensity values for the nuclear features. The capacity to query the data should enable the pathologists and the statisticians to perform complex analysis; for example, queries to compare different analyses performed with different input parameters and the selection of anomalous results with irregular staining patterns.
At this stage, only preliminary capture of the output catalogues to the database system has been implemented. However, this will be a significant focus of future work, especially with the acquisition of increasingly large datasets and resultant output catalogues.
We note that currently the output binary tables are being ingressed into an Oracle 9g RDBMS. The interface is provided by the AstroGrid Dataset Access DSA component. This provides a service interface to the Oracle database system that makes it compliant and accessible from the PathGrid workbench. In particular queries can be actioned through a workflow. The database system to support PathGrid will be more fully described in a forthcoming paper.
3. Astronomical algorithms
The PathGrid system incorporates a number of image analysis algorithms developed for the analysis of optical and infrared image data.
The Cambridge Astronomical Survey Unit (CASU; http://casu.ast.cam.ac.uk) at the Institute of Astronomy (IoA), University of Cambridge, is the main UK centre of expertise in the analysis of astronomy image data. It is responsible for the processing of significant volumes of imaging data from a range of major observatories. In particular, it has both developed the analysis algorithms and associated pipelines, and operated these pipelines in support of large public surveys from ESO’s 4 m VISTA infrared telescope (e.g. Dye et al. 2006). This telescope, commissioned in 2009, is now being used in survey mode generating typically some 100 GB of data per night, which will eventually lead to image archives of hundreds of terabytes. The analysis systems have been designed specifically to support high-throughput data flows, and are thus robust, and have a high degree of automation.
The astronomy analysis pipelines are designed to extract the maximum amount of information from the astro-imaging sky surveys, enabling the highest possible science return from these surveys.
The processing chains typically involve a range of operations.
— Image processing to remove instrumental effects to generate a linear photon noise-limited image. Deep stacking is often undertaken to enable the rejection of various artefacts such as cosmic rays and bad/dead pixels.
— Detection and parametrization of objects in the images. The extraction algorithms are able to detect objects against complex background variations (at local and global scales). Optimal detection is via matched filters with image segmentation into objects. For each extracted object, parameter estimations are generated, giving information on, for instance, position, flux and morphology.
— External calibration and object classification. This covers both astrometric (positional) and photometric (flux) calibration of the images and extracted objects. These parameters in turn enable classification of each object to be made, for instance generating a star/galaxy probability for any object based on shape/morphology parameters.
— A range of quality control parameters are automatically generated, allowing for an estimation of image quality, background variation, detector performance, etc.
— Matching of images is often performed, across image bands (e.g. measuring the appearance and property of objects as detected through differing colour filters), or detecting variations in the position or flux of an object in time series data.
Figure 3 shows an example of a recent commissioning image of the centre of our Milky Way as observed with the VISTA telescope. This image is composed of three infrared bands and shows the complexity, and richness of structure at the heart of our Galaxy.
The detection of objects follows a two-stage process. A background is fitted and removed, and remaining objects are then identified and parametrized. The technique for detection of the objects follows the formulation described in Irwin (1985) which uses optimal matched filter detection techniques, feature extraction using thresholded pixel connectivity (Lutz 1979) and deblending of objects through multiple hill climbing, analagous to watershedding (Meyer 1991).
These then are the main analysis algorithms, which have been transferred and adapted for use as the source extraction routines in PathGrid. These routines are applied to the microscopy images, with the full parametrization of each detected object being output to an object catalogue file specific to each of those images. The pilot study showed that simple changes to the configuration parameters of the object detection algorithm were sufficient to achieve a high degree of detection efficiency. An example is shown in figure 4 where a first run of the detection algorithm fails to locate individual cell nuclei. However, by decreasing the parameter governing the initial estimation for the FWHM size of the objects of interest, an excellent detection efficiency is achieved.
PathGrid provides a suite of applications covering the whole workflow process, from the conversion of the images as received from the image scanner, object extraction and statistical routines for final high level analysis.
Effective use of a number of client side tools from astronomy is made in the handling and visualization of the microscopy data. The bulk binary table catalogue files generated as a result of the image analysis process can be converted into comma separated value (CSV) and extensible markup language (XML) files, using the TopCat Stilts libraries (Taylor 2006) if the local user so wishes. TopCat (Taylor 2005; see http://www.star.bris.ac.uk/~mbt/topcat/) is also used as an interactive graphical viewer and editor for the PathGrid tabular data. Aladin (Bonnarel 2000; see http://aladin.u-strasbg.fr/) is used to handle the display of image data. It has the ability to stack images, and also allows for the efficient visualization of catalogue information. Figure 6 shows an examples of the use of TopCat and Aladin in combination—with data handling between them and the data as stored in the virtual storage area enabled via use of the Simple Application Message Protocol (see http://www.ivoa.net/Documents/latest/SAMP.html; an interoperability standard developed through the IVOA). Note that the bulk data are generated in binary form; however, use of CSV is most appropriate for rapid database ingression, while the XML representation is suitable for visual display owing to the ease of transforming XML data.
4. PathGrid testbed
The current 2009 deployment of the PathGrid data infrastructure testbed links services at the IoA and Cambridge Research Institute (CRI) of Cancer Research-UK (CR-UK).
The hardware system includes two eight core Dell Poweredge servers, with associated 2 TB disk stores. One server is configured to host the PathGrid community, registry and ‘VOSpace’ disk store, while the other acts as the application server.
Figure 5 shows the configuration of the PathGrid infrastructure modules across the servers. The end user runs the user client software from any remote location. Currently, the end users are located at the CRI, CR-UK.
The experience gained from the initial use of the system now underway at the CRI, with the analysis of large sets (2500 samples) of ER (nuclear marker—the analysis of which described below in §5) and HER2 (membrane marker—the analysis of which is described in our forthcoming paper) microscopy data, processed with PathGrid analysis workflows is demonstrating the efficiacy of the system. This initial deployment has demonstrated a range of key features, required for future larger scale rollout. This includes secure data transport of the image sets from the scanning microscopes at CRI to the development analysis server at the IoA, the deployment of the client user interface tools at CRI, the actioning of the relevant workflow from that client, with the actual analysis run on the servers at the IoA, together with the ingression of the output catalogues into the development Oracle database at the IoA. The reduced images and data products are accessible via the user clients at CRI.
Figure 6 shows the result of the visualization of one of the image cores and the use of desktop client tools interoperating to handle the interplay between the display of image and the catalogue data.
At this stage, little optimization of the workflows has been undertaken. However, preliminary use of the processing chain on the sample datasets has demonstrated that a full analysis of a typical 180 core (equivalent to one slide) dataset (equating to approx. 500 MB of image data) requires of the order of 1200 s using the current testbed application server. We note here that the processing overheads introduced by the workflow management system are of the order of 20 per cent, which is an acceptable value when balanced against the operational efficiency that use of the workflow system brings. This overhead is mainly because of the exchange of XML-formatted control messages between the server and Taverna workflow. These messages are small and thus they only impose this modest additional overhead.
5. Pilot study validation
The initial validation of the PathGrid system was undertaken by developing and evaluating a scoring algorithm applied to a sample ‘nuclear marker’ dataset. In brief, the presence of the oestrogen receptor (ER) protein in data obtained through the Eastern Cancer Registration & Information Centre (ECRIC) campaign (Wishart et al. 2010) was scored and compared with ‘gold standard’ pathologist scoring (Makretsov et al. 2008). The pilot validation described here was performed using the PathGrid testbed system as described in the previous section.
Pathologists use the Allred classification (Allred et al. 1998; Harvey et al. 1999) to assign a score from 0 to 8 scale for each immunostained image core within a slide. This factor is composed of two elements: an estimation of the proportion of positively stained tumour cells (0, none; 1, <1%; 2, 1–10%; 3, 10–33%; 4, 33–66%; and 5, >66%) together with an intensity score describing the average intensity of the positive tumour cells (0, none; 1, weak, 2, intermediate; and 3, strong). Added, these give a range of 0–8. Scores of 0–2 represent a negative result, whereas scores in the range 3–8 indicate a positive result.
The analysis showed high sensitivity and specificity when the automated PathGrid technique was used to generate an equivalent Allred score and compared with the pathologists’ gold standard scores. Moreover, the automated processing and scoring was significantly faster (Walton et al. 2009), measured in minutes for the automated technique compared with hours for the manual scoring. Our forthcoming paper gives a fuller description of the experimental data used in this study.
The validation process involved the development of a multi-step process. First, all image data obtained from the imaging microscopy system were ingressed to the PathGrid system in the form of standard JPG images as exported by the Ariol microscope software system.
These JPG images were then converted to the multi-extension FITS file format used in astronomical data analysis (e.g. Hanisch et al. 2001) with each JPG image being decomposed into its three component channels: red (R), green (G) and blue (B). (We note that FITS is an efficient file format, and is used throughout the processing chain, enabling all applications to rapidly access the individual RGB colour channels. The use of FITS, as the format which the astronomical analysis routines accept, has significant speed gains compared with those algorithms that could handle native JPG files. The overhead in the transformation from JPG to FITS is very low.) These individual channels were then processed to enable a ‘colour’ analysis of each image to be undertaken. For convenience from an astronomy imaging perspective, prior to processing each image channel was inverted (x → 255-x) with the result that a ‘brown’ stain (absence of blue light) now becomes a ‘blue’ glow.
The underpinning object detection and morphological analysis was done on a ‘black and white’ image created by coadding the R+G+B channels. With an object list it is then straighforward to place apertures over each detected feature and integrate the flux in each channel with respect to the local background per channel. This enables a detailed colour/intensity analysis to be made for each detected feature. The full object shape descriptors are used to preselect features that are most likely to be nuclei based on their circularity and their size on the image. This, for example, allows simple rejection of the majority of fibroblasts in ER data, which have significant ellipticity.
The degree and intensity of nuclear staining for each detected nuclear region then follows trivially from the ratio of blue channel flux (stain) to the average of red and green (reference). Examples summarizing these measures are shown in figure 7, which demonstrates the distribution of the nuclear staining parameters for a heavily stained image and for an unstained image. The dotted lines are internal estimates generated from the loci of points denoting unreliable faint detections (vertical) and the dynamically computed boundary between stained and unstained nuclei.
These distributions of individual nuclear data points are then treated as an ensemble to create an overall score for each image. To duplicate the Allred scoring process as closely as possible, we define two statistics per image: the ratio of stained to unstained nuclei, given by counting the ratio of ‘blue’ points above the horizontal dashed line compared with the total number of points (the ‘proportion’ statistic); and the median intensity ratio of the blue points relative to the median locus of the points below the line (the ‘intensity’ statistic). For each image, these summary statistics are generated.
In figure 6, the results for one slide sample of 182 cores are also shown. These are displayed using TopCat; note the interplay between windows, selecting a point in the proportion/statistic plane and locating that within the tabular data.
The analysis of Makretsov et al. (2008) compared manual scores with those obtained using the Genetix Ariol processing software, and a semi-automatic technique using the NIH IMAGEJ (Collins 2007) image analysis tool.
A two-dimensional receiver operating characteristic analysis (cf. Florkowski 2008) of the manually scored 273 sample IHC slices is used to define the optimal decision boundary based on a combination of the resulting sensitivity (the proportion of positives which are correctly identified as such) and specificity (the proportion of negatives correctly identified as negatives) figures. Figure 8 shows the automatic PathGrid scoring compared with the gold standard manual scoring determined in the earlier study of Makretsov et al. The results are shown in table 1. It is apparent that the PathGrid algorithm for this nuclear marker gives results comparable to those from both the Ariol software system and the IMAGEJ manual scoring.
The full analysis of these results together with the a validation study against a membrane marker is described in our forthcoming paper.
6. Outlook and conclusions
This paper has described the initial pilot development of the PathGrid system and demonstrated that data analysis and data handling techniques developed for astronomical data are applicable when applied for use in the analysis and data integration of medical microscopy imaging data.
The initial validation of the PathGrid analysis algorithms applied to the case of the ER marker data demonstrates the accuracy of the results compared against gold standard scoring.
The future direction of the program will involve the development of analysis algorithms and processing pipelines to be adapted to a wider range of IHC image data. For instance, the extension of the algorithm set is now being extended to cover a range of cytoplasmic (e.g. Bcl2) and membrane (e.g. HER2) markers. Initial validation on test sample data for these is currently under way. In particular, our assessment of the algorithm for a ‘membrane marker’ dataset (approx. 2400 samples of HER2 data) shows both significantly improved speed compared with manual pathologist scoring using the ‘Hercep test’ guidelines, and improvements in the specificity and sensitivity of the assessments.
Further, from the infrastructure and data handling aspects, the PathGrid testbed system is currently limited to a small scale deployment at the IoA, Cambridge, and CRI, CR-UK. In order to support an extended set of distributed researchers, involved in wider research collaborations, the client software will be made more widely available.
Work described in this paper was funded through an MRC Discipline Hopper programme award (G0601785) and via a STFC miniPIPSS grant (ST/G003556/1). We acknowledge advice through the Oracle EMEA External Research and Development Programme. Use is made of the software developed by the AstroGrid Virtual Observatory Project, which was funded by the Science and Technology Facilities Council and through the EU’s Framework 6 programme.
One contribution of 16 to a Theme Issue ‘e-Science: past, present and future I’.
- © 2010 The Royal Society