The start-up of the Large Hadron Collider (LHC) at CERN, Geneva, presents a huge challenge in processing and analysing the vast amounts of scientific data that will be produced. The architecture of the worldwide grid that will handle 15 PB of particle physics data annually from this machine is based on a hierarchical tiered structure. We describe the development of the UK component (GridPP) of this grid from a prototype system to a full exploitation grid for real data analysis. This includes the physical infrastructure, the deployment of middleware, operational experience and the initial exploitation by the major LHC experiments.
1. The computing challenge of the Large Hadron Collider
The Large Hadron Collider (LHC; Evans & Bryant 2008) will become the world's highest energy particle accelerator when it starts full operation. Protons, with energies of up to 7 TeV, will be collided at 40 MHz to recreate some of the conditions that prevailed in the Universe during the earliest moments of the ‘Big Bang’. Positioned around the 27 km superconducting collider will be four major experiments—ALICE (Aamodt et al. 2008), ATLAS (Aad et al. 2008), CMS (Chatrchyan et al. 2008) and LHCb (Augusto Alves et al. 2008)—which will record the particle interactions of interest. These four experiments contain a total of approximately 150 million electronic sensors and the rate of data flowing from them will be approximately 700 MB s−1, equivalent to 15 PB yr−1. The processing and analysis of these data will require an initial CPU capacity of 100 000 processors operating continuously over many years. This capacity will need to grow with the accumulated luminosity from the LHC and is expected to double by 2010 (Knobloch 2005).
Particle physicists have chosen grid technology to meet this huge challenge with the computing and storage load distributed over a series of centres as part of the worldwide LHC computing grid (WLCG).
2. Worldwide LHC computing grid
Once the LHC is in continuous operation, the experimental data will need to be shared between 5000 scientists scattered around 500 institutes in the world. Copies of the data will also need to be kept for the lifetime of the LHC and beyond—a minimum of 20 years.
The WLCG currently consists of 250 computing centres based on a hierachical tiered structure (Shiers 2007). Data flow from the experiments to a single Tier 0 centre at CERN, where a primary back-up of the raw data is kept. After some initial processing, the data are then distributed over an optical private network, LHCOPN (Foster 2005), with 10 Gb s−1 links to 11 major Tier 1 centres around the world. Each Tier 1 centre is responsible for the full reconstruction, filtering and storage of the event data. These are large computer centres with 24×7 support. In each region, a series of Tier 2 centres then provide additional processing power for data analysis and Monte Carlo simulations. Individual scientists will usually access facilities through Tier 3 computing resources consisting of local clusters in university departments or even their own desktops/laptops (figure 1).
The WLCG is based on two major grid infrastructures: the Enabling Grids for E-SciencE project or EGEE (Jones 2005) and the US OpenScience Grid (Pordes et al. 2008). The Scandinavian countries also contribute the NorduGrid infrastructure.
3. GridPP architecture
The UK component of the grid infrastructure for particle physics has been built by the GridPP collaboration through the joint efforts of 19 universities, the Rutherford Appleton Laboratory (RAL) and CERN. The initial concept was first taken through a prototype stage in 2001–2004 (Faulkner et al. 2006) and this was then followed by building a production-scale grid during 2004–2008. With the start of LHC commissioning in 2008, the project has just entered its exploitation phase.
(a) UK Tier 1 centre
The UK Tier 1 centre is located at RAL (figure 2) and the 2008 hardware configuration is based around a CPU cluster of 3200 cores delivering approximately 2500 job slots and 340 disk servers providing 2.3 PB of disk storage. A Sun SL8500 tape robot provides 10 000 media slots, 18 T10K tape drives and a storage capacity of 5 PB.
After receiving a share of the raw data from CERN, the particle interactions (so-called ‘events’) are reconstructed from the electronic information recorded in the various sub-detectors of each experiment. The extracted physics information is then used to select, filter and store events for initial analysis. Datasets are then made available to the Tier 2 centres for specific analysis tasks.
A hierarchical storage management system is used for large-scale storage of data files at the Tier 1. This is based on the CERN Advanced STORage manager (CASTOR) 2 system (Presti et al. 2007). Each of the LHC experiments uses separate CASTOR instances to avoid resource conflicts and to minimize problems with data flow.
(b) UK Tier 2 centres
GridPP has developed four regional Tier 2 centres as shown in figure 2: LondonGrid; NorthGrid; ScotGrid; and SouthGrid. These primarily focus on providing computing power for generating simulated Monte Carlo data and on the analysis of data by individual physicists. Each Tier 2 centre is a federation of computing facilities located in several institutes—for example, NorthGrid consists of the universities of Lancaster, Liverpool, Manchester and Sheffield.
The distributed Tier 2 resources currently provide approximately 10 000 job slots and approaching 2 PB of disk storage. The 17 individual sites vary in size from large centres such as Manchester providing 1800 job slots (2160 KSI2K) and 162 TB of disk, to small departmental clusters of a few machines.
The middleware used in GridPP is the gLite distribution, currently v. 3.1 (gLite 2008), from the EGEE project and adopted by WLCG across all of its sites. This consists of components from a number of different grid projects.
(a) Workload management system
A grid job is specified using a Job Definition Language, based on the Condor ClassAd language, and this is submitted in a script together with the necessary program and input files through the workload management system (WMS; Andreetto et al. 2008). A Resource Broker component accepts each job and matches it to a suitable site for execution. Other components transmit the job to the relevant site and finally manage any output. Sites accept jobs through a gatekeeper machine known as a computing element, which then schedules the job to run on a worker node within the cluster. Small output files are transmitted back through the WMS, while larger data files may be written to a storage element and catalogued.
(b) Data management
GridPP supports the small-scale Disk Pool Manager and medium-scale dCache storage solutions at the Tier 2 sites, in addition to the large-scale CASTOR system at the Tier 1 centre. The problem of file access across a heterogeneous grid with multiple mass storage systems has been solved by the storage resource management or SRM interface. This protocol is designed to provide access to large-scale storage systems on the grid, allowing clients to retrieve and store files, control their lifetimes as well as reserving file space for uploads, etc. The concepts of storage classes and space tokens were introduced recently (Donno et al. 2008) to enable experiments to place data on different combinations of storage devices and to dynamically manage storage space.
Data distribution between the Tier 0, Tier 1 and Tier 2 centres is performed using the File Transport Service. This uses unidirectional channels to provide point-to-point queues between sites with file transfers handled as batch jobs, providing prioritization and retry mechanisms in the case of failure.
(c) Distributed databases and catalogues
In addition to the interaction data (physics events), experiments also require a large amount of non-event data, describing the current running parameters and calibration constants associated with each sub-detector, to be stored and accessible all over the grid. These time-varying data are essential for the reconstruction of physics events and are stored in a conditions database. Users and production programs also need the ability to locate data files (or their replicas) and this is achieved through the WLCG File Catalogue (LFC). The LFC contains the mappings of logical file names to physical files, along with any associated replicas on the grid.
These worldwide distributed databases have been set up by the WLCG 3D (Distributed Deployment of Databases for WLCG) project using Oracle Streams technology to replicate the databases to the external Tier 1 centres outside CERN (Duellmann et al. 2006). At the UK Tier 1 centre, there are independent multi-node database clusters for the conditions databases of both the ATLAS (Viegas et al. 2008) and LHCb (Clemencic 2008) experiments. This ensures high availability and allows for large transaction volumes. The LFC is also based on an Oracle Enterprise relational database, but the ATLAS and LHCb catalogues reside on multi-node clusters shared with other services. These are highly configurable allowing for all services to still run if a node fails.
5. User access and experiment-specific software
Particle physicists access the grid from a user interface—a departmental or desktop computer with the user-level client tools installed. Authentication is based on digital X.509 certificates, which in the case of the UK are issued by the National Grid Service. Individual physicists belong to a Virtual Organization (VO) representing their individual experiment and each computing site in the UK decides which VOs can use its facilities and the appropriate level of resource. A VO Membership Service provides authorization information, specifically the roles and capabilities of a particular member of a VO. At the beginning of each session, a proxy certificate is obtained for a limited lifetime (typically 12 hours).
Experiments have also developed front-ends to simplify the definition, submission and management of jobs on the grid. The Ganga interface (Maier 2008), a GridPP-supported project, is the best-known example and is used by the ATLAS and LHCb experiments. This allows jobs to be either run on a local batch system or the grid and provides all of the facilities for job management, including submission, splitting, merging and output retrieval. Ganga can be used via three methods: an interactive interface; in a script; or through a graphical interface. The adoption of Ganga enables a physicist to exploit the grid with little technical knowledge of the underlying infrastructure. More than 1000 unique users of Ganga have been recorded in 2008 as data analysis programs have been prepared for the switch-on of the LHC.
Similarly, experiments have written their own data management layers that sit on the top of the standard grid services. For example, the GridPP-supported Physics Experiment Data Export system of the CMS collaboration provides a data placement and file transfer system for the experiment (Tuura et al. 2008). This is based around transfer agents, management agents and a control database, providing a robust system to manage global data transfers. The agents communicate asynchronously and between centres the system can achieve disk-to-disk rates in excess of 500 Mb s−1 and sustain tape-to-tape transfers over many weeks.
6. Grid deployment, operation and services
In order to achieve a high quality of service across the wider grid, the WLCG/EGEE infrastructure is divided into 10 regions, each with a Regional Operations Centre (ROC) responsible for monitoring and solving operational problems at sites within its domain. GridPP sites are covered by the UK/Ireland ROC.
Computer hardware in the underlying clusters is usually monitored locally by open-source programs such as Nagios and Ganglia. These can display variations in cluster load, storage consumption and network performance, raising alarms when thresholds are reached or components fail. The basic monitoring tool for the WLCG grid is the Service Availability Monitoring (SAM) system. This regularly submits grid jobs to sites and connects with a number of sensors that probe sites and publishes the results to an Oracle database. At the regional level, problems are tracked by Global Grid User Support tickets issued to the sites affected or to the middleware/operations group.
The LHC experiments also need to have a clear picture of their grid activities (e.g. job successes/failures, data transfers, installed software releases) and so have developed their own SAM tests that probe the site infrastructures from their own specific standpoint. Dashboards have been developed to present this information in a concise and coherent way. In addition, GridPP has found it useful to also develop a suite of tests that regularly exercise the essential components in the experiment computing chains. For the ATLAS experiment, jobs are sent every few hours to each UK site on the grid and perform an exemplar data analysis. As can be seen in figure 3, the overall efficiency (ignoring some specific incidents) across all GridPP sites has varied between 60 and 90 per cent and the continuing challenge is to improve the robustness and reliability of the grid for experiments.
7. Experiment exploitation and research applications
Over several years, the LHC experiments have individually exploited the evolving grid infrastructure through a series of ‘data challenges’ and ‘service challenges’. These have been based on large data samples, typically containing many millions of simulated events, which have been generated using the grid and then also processed by the grid using the same data-processing chains prepared for real data.
In readiness for the switch-on of the LHC, all of the experiments simultaneously conducted a Common Computing Readiness Challenge (CCRC08) at the start of 2008, with phases in February and May. The number of batch jobs submitted each month to the UK Tier 1 centre during the build-up to LHC commissioning is shown in figure 4a, while the load of simultaneous running jobs on the Tier 1 during one week of CCRC08 is shown in figure 4b.
In proton–proton mode at nominal luminosity, the four LHC experiments are expected to produce a total data rate of approximately 1600 MB s−1 from the Tier 0 to the 11 external Tier 1 centres. The UK share is estimated to be of the order of 150 MB s−1 and from figure 5 it can be seen that this was achieved during the simultaneous challenges in February and May 2008.
Monte Carlo techniques play a vital role in the modelling, analysis and understanding of particle physics data. As an example, we show the results of a very large number of simulations on the particle physics grid. By simulating a statistical sample of events, equivalent to the size of the expected dataset, many thousands of times, we can study the precision with which a parameter will be measured when real data arrives. Typically, this is undertaken on the grid by farming out many hundreds of simulation runs and then collecting the results to produce a set of statistical distributions or to show a trend. This is illustrated in figure 6, which shows the value of a systematic detector effect between various species of particles after they have been reconstructed in the LHCb experiment, together with a Monte Carlo study of fitting a physics parameter, , which may be biased by this effect. This parameter is important in studying the asymmetries between the decays of the neutral B meson (particle) and anti B meson (antiparticle), which ultimately should improve our explanations for the prevalence of matter in the Universe. To produce the plot in figure 6b, each simulated model ‘experiment’ required one million signal events to reach a statistical precision comparable with six months of data taking. Several models were investigated using different input configurations and each model required approximately 500 ‘experiments’ to be run—leading to a very large number of simultaneous jobs on the grid consuming over 100 000 hours of CPU time.
Another example of the power of the grid comes from the CMS experiment (CMS Collaboration 2008), which simulated, processed and analysed a sample of over 20 million physics events containing pairs of muons. The aim was to perform a measurement of the μμ mass spectrum and isolate any resonances after processing and filtering the data. A fit to the final subsample of events after track isolation is shown in figure 7, where the Z°→μ+μ− decay can be seen prominently sitting on the background from quantum chromodynamics (QCD).
A working grid for particle physics has been established in the UK with the necessary resources for the early exploitation of the LHC. The LHC is poised to become the frontline facility for particle physics over the next decade and GridPP is a vital component of the worldwide infrastructure built to process and analyse the data recorded by the four main experiments.
The commissioning of the collider started with a series of injection tests in August and September 2008. On 10 September 2008 (the penultimate day of the All Hands Meeting 2008 conference), the first circulating proton beam through the entire 27 km of the LHC ring was achieved, closely followed by several hundred orbits. Particle interactions originating from the beam were observed in all four experiments. A technical fault with the accelerator developed on 19 September and proton–proton collisions are now scheduled for autumn 2009; GridPP will then be ready to record the first real LHC collision data following eight years of development.
We acknowledge financial support from the Science and Technology Facilities Council in the UK and from the EGEE collaboration. We wish to thank the many individuals in the UK, both in the Tier 1 centre and the Tier 2 institutes, who have helped to build GridPP. This grid project would also have not been possible without the major contributions from our WLCG and EGEE colleagues at CERN and around the world. The CMS Collaboration and Rob Lambert are thanked for making available the results of their simulations.
One contribution of 16 to a Theme Issue ‘Crossing boundaries: computational science, e-Science and global e-Infrastructure I. Selected papers from the UK e-Science All Hands Meeting 2008’.
- © 2009 The Royal Society