Back to top
Concept of Operations for the LSST
Back to top
D. Petravick and M. Gelman
Back to top
Back to top
Document by the LSST DM Technical Control Team. If this document is changed or superseded,
the new document will retain the Handle designation shown above. The control is on the most
recent digital document with this Handle in the LSST digital archive and not printed versions.
Additional information may be found in the corresponding DM RFC. –
Draft Revision NOT YET
Back to top
This document describes the operational concepts for the emerging LSST Data Fa-
cility, which will operate the system that will be delivered by the LSST construction
project. The services will be incrementally deployed and operated by the construc-
tion project as part of verification and validation activities within the construction
Back to top
1.1 2013-09-10 Updates resulting from Process Control and
1.2 2013-10-10 TCTapproved
2016-05-8 Beginningto enderworkinggroupschemaas
more complete view of operational need as a
2017-06-26 Import draft versions into single document,
with updates based on evolved operational
Document source location:
Back to top
2.1 LSSTCamPromptProcessingServices ......................... 2
2.1.1 Scope ........................................ 2
2.1.2 Overview ...................................... 2
2.1.3 OperationalConcepts .............................. 3
2.2 LSSTCamArchivingService ............................... 7
2.2.1 Scope ........................................ 7
2.2.2 Overview ...................................... 7
2.2.3 OperationalConcepts .............................. 8
2.3 SpectrographArchivingService ............................. 9
2.3.1 Scope ........................................ 9
2.3.2 Overview ...................................... 9
2.3.3 OperationalConcepts .............................. 10
2.4 EFDETLService ...................................... 10
2.4.1 Scope........................................ 11
2.4.2 Overview ...................................... 11
2.4.3 OperationalConcepts .............................. 12
2.5 OCS-DrivenBatchService ................................ 14
2.5.1 Scope........................................ 14
2.5.2 Overview ...................................... 14
2.5.3 OperationalConcepts .............................. 15
2.6 ObservatoryOperationsDataService ......................... 15
2.6.1 ScopeofDocument ............................... 15
2.6.2 Overview ...................................... 15
2.6.3 OperationalConcepts .............................. 18
2.7 ObservatoryOperationsQAandBaseComputingTaskEndpoint ......... 18
3.1 BatchProductionServices ................................ 19
3.1.1 Scope........................................ 19
3.1.2 Overview ...................................... 19
3.1.3 OperationalConcepts .............................. 20
4.1 User Data Access Services ................................ 27
4.2 BulkDataDistributionService ............................. 28
4.3 HostingofFeedstoBrokers ............................... 28
5 Data,Compute,and ITSecurityServices
5.1 DataBackboneServices ................................. 29
5.1.1 Scope........................................ 29
5.1.2 Overview ...................................... 29
5.1.3 OperationalConcepts .............................. 30
5.2 ManagedDatabaseServices............................... 31
5.2.1 Scope........................................ 31
5.2.2 Overview ...................................... 32
5.2.3 OperationalConcepts .............................. 33
5.3 BatchComputingandDataStagingEnvironmentServices ............. 33
5.3.1 Scope........................................ 33
5.3.2 Overview ...................................... 34
5.3.3 OperationalConcepts .............................. 35
5.4 Containerized Application Management Services .................. 35
5.4.1 Scope........................................ 35
5.4.2 Overview ...................................... 36
5.4.3 OperationalConcepts .............................. 37
5.5 Network-based ITSecurityServices .......................... 37
5.5.1 Scope........................................ 37
5.5.2 Overview ...................................... 37
5.5.3 OperationalConcepts .............................. 40
5.6 Authentication and Authorizations Services ..................... 42
7.1 ServiceManagementProcesses ............................ 44
7.1.1 Overview ...................................... 44
7.2 ServiceMonitoring .................................... 45
7.2.1 Scope........................................ 45
7.2.2 Overview ...................................... 45
7.2.3 OperationalConcepts .............................. 47
Back to top
Concept of Operations for the LSST Data Facility
Back to top
Back to top
1 Scope of Document
will operate the data management system as a set of services that will be delivered by the
construction project as part of validation and verification activities within the construction
Back to top
The LSST Data Facility provides a set of services that supports specific functions of Observa-
• An “Offline” L1 Batch Processing Service, not commanded by OCS, to facilitate catch-
up processing for use cases involving extensive networking or infrastructure outages,
eprocessing of image parameters used by the Scheduler, pre-processing data ahead
of release production for broker training, and other emergent use cases as directed by
• An Archiving Service for acquiring raw image data from the LSST main camera and in-
gesting it into the Data Backbone.
scope and ingesting it into the Data Backbone.
• An OCS-driven Batch Processing Service for Observatory Operations to submit batch
jobs via the OCS environment either to NCSA or to the Commissioning Cluster at the
• A QA and Base Computing Task Endpoint that allows fast and reliable access through
the QA portal to recently acquired data from the Observatory instruments and other
• An Observatory Operations Data Service that allows fast and reliable access to data re-
The concept of operations for each of these services is described in the following sections.
This section describes the prompt processing of raw data acquired from the main LSST cam-
eraby the DM system.
main LSST camera as they are taken, and promptly processes them with codes specific to an
The LSSTCam Prompt Processing Services provide timely processing of
newly acquired raw data, including QA of images, alert processing and delivery, eturning
Facility as part of the Level 1 system. It is presented to Observatory Operations as an OCS-
commandable device. The Prompt Processing Service etrieves crosstalk-corrected pixel data
from the main LSST camera at the Base Center, builds FITS images, and sends them to NCSA
188.8.131.52.1 Science Operations
Science data-taking occurs on nights when conditions are
suitable. For LSST, this means all clear nights, even when the full moon brightens the night
sky. Observing is directed by an automated scheduler. The scheduler considers observing
conditions, for example, the seeing, the phase of the moon, the atmospheric transparency,
and the part of the sky near the zenith. The scheduler is also capable of receiving external
alerts, for example, announcements from LIGO of a gravitational wave event. The scheduler
also considers equired observing cadence and depth of coverage for the LSST observing
About90%ofobservingtimeis eservedfortheLSST “wide-fast-deep” program. Inthispro-
potentially with a new filter, a larger slew, a different observing cadence, or a different visit
Another evisioned program is “deep drilling”, where many more exposures than the two ex-
posure visit will be taken.
In practice, science data-taking will proceed when conditions are suitable. Calibration data
may be taken when conditions are not suitable for further observations, with science data-
It follows that the desired behavior for science data-taking operations is to start the Prompt
Processing system at the beginning of the night and to turn off the system after all processing
will be taken, until the “next visit” is determined.
During science data-taking the Prompt Processing Service computes and promptly eturns
QA parameters (referred to as “telemetry”) to the observing system. The QA parameters are
not specific to an observing program; examples are seeing and pointing corrections derived
from the WCS. These parameters are not strictly necessary for observing to proceed – LSST
can observe autonomously at the telescope site, if need be. Also note that the products are
qualityparameters,notthethe “up-or-down” qualityjudgement.
The scheduler may be sensitive to a subset of these messages and may decide on actions, but
a detailed description is TBD and may vary as the survey evolves. The scheduler can make
use of these parameters even if delivered quite late, since the scheduler uses historical data
in addition to recent h data.
The Prompt Processing system also executes code that is specific to an observing program.
For science exposures, the dcode is divided into a front-end – that is abel to compute the
parameters sent back to the observator, and a back end , Alert Production (AP), is the specific
Thedetectedtransientsarepassedoff toanotherservince,which ecoredthedatadateina
catalog which can be queried offline and sends to an ensemble of transient brokers. Data are
AP runs in the context of the wide-fast-deep survey, deep drilling program and TBD other
programs. OtherobservingprogramsmayalsoincludeAPasasciencecode, ormayhave
codes of their own.
the Baseline Claibraitons include flats and biases. Dariks are nto anticipated. LSST has an
Nominally, a three-hour period each afternoon is scheduled for Observatory Operations to
take dark and flat calibration data. As noted above, calibration data may be taken during
the night when conditions are not suitable for science observations. As well, the LSST dome
is specified as being light-tight, enabling certain calibration data to be collected whenever
operationally feasible, egardless of the time of day.
over the lifetime of the survey as understanding of the LSST telescope and camera improves.
The Prompt Processing Service computes and promptly eturns QA parameters (referred to
as “telemetry”) to the observing system. Note that the quality of calibrations needed for
Prompt Processing science operations may be less stringent than calibrations needed for
An operations strawman, which illuminates the general need for prompt processing, is that
• Nightly flats, biases and darks consist of about 10 broad-band flatfield exposures in
at the cadence of the expected exposure time. Observers will consider the collection of
these nightly calibrations as a single operational sequence that is typically carried out
eters for quality assessment of these calibration data, and eturns the QA parameters
to the observing system. Examples of defects that could be detected are the presence
of light in a bias exposure and a deviation of a flat field from the norm, indicating a
possible problem with the flat-field screen or its illumination. The sequence is consid-
ered complete when processing (which necessarily lags acquisition of the pixel data) is
• Narrow-band flatsandcalibrationsinvolvingthecollimatedbeamprojectorhelpdeter-
mine the response of the system, as a function of frequency, over the optical surfaces.
(760nm)islargecomparedtothe 1nmilluminationsource, andoperationsusingthe
CBP must be repeated many times as the device is moved to sample all the optical paths
in the system. The length of time needed to collect these calibrations leads to the re-
Time for an absolutely dark dome, which is important for these calibrations, is subject to
precious, prompt processing is needed to run QA codes to help assure that the data are
appropriate for use. Note, the prompt processing system will not be used to construct these
the system must be reasonably available when needed. An approach equires minimal coor-
dination between the observing and the archive center, which as described below is respon-
sible for maintenance of the system, is a default daily maintenance window, with deviations
184.108.40.206 Operational Scenarios
the pace of processing fail to keep up with the pace of images, operations input is needed
because there are trade-offs, as this would affect the production of timely QA data. How-
ever, note that when this situation occurs no immediate human intervention is needed. The
erations that can be selected via the OCS interface. These policies are used to prioritize the
need for prompt production of QA versus completeness of science processing, decide the
conditions when science processing should abort due to timeout, and determine how any
is managed. The policies may need to be sensitive to details of the observation sequence;.
Whenneeded, allofthisprocessing, includingbothgenerating
QA parameters and running science codes, can be executed under offline conditions at the
Archive Center at a later time. The products of this processing may still be of operational or
scientific value even if they are not produced in a timely manner. Considering Alert Produc-
tion, for example, while alerts may not be transmitted in offline conditions, transients can
stillbeincorporatedintotheportionoftheL1Databasethat ecordstransients. QAimage
parameters used to gauge whether an observation meets quality equirements can still be
produced and ingested to the OCS system.
duced in the LSST Data Facility. Change control of this function is coordiniated with the Ob-
servatory, with the Observatory having an absolute say about insertion and evaluation of
from the main LSST camera to the permanent archive.
The LSSTCam Archiving Service acquires pixel and header data and
arranges for the data to arrive in the Observatory Operations Data Server and in the Data
LSST main camera and header data from the OCS system, and to place appropriately format-
ted data files in the Data Backbone. The service needs to have the capability of archiving at
the nominal frame rate for the main observing cadence, and to perform “catch up” archiving
at twice that rate.
the Level 1 system. It is presented to Observatory Operations as an OCS-commandable de-
as data acquisition, as well as other Level 1 services, such as prompt processing. However,
a normal operational mode is operation of the service such that data are ingested promptly
LSSTCam data is, by default, ingested into the permanent archive and into the Observatory
Operations Data Server. However, while all science and calibration data from the main cam-
era equire ingest into the Observatory Operations Server, some data (e.g., one-off calibra-
tions,engineeringdata,smoketestdata,etc.) maynot equirearchivingintheDataBackbon
220.127.116.11 Operational Scenarios
18.104.22.168.1 Delayed Archiving
In delayed archiving, Observatory Operations may need to
with the Observatory Operations Data Service. The archiving service provides a number of
2.3 Spectrograph Archiving Service
TheAuxiliaryTelescopeis aseparatetelescopeatthe Summitsite,
located on a ridge adjacent to that of the main telescope building. This telescope supports a
tothe filterpassbandsonthemainLSSTcamera. Thepurposeofthespectrophotometer
is to measure the absorption, which is how light from astronomical objects is attenuated as
it passes through the atmosphere. By pointing this instrument at known “standard stars”
that emit light with a known spectral distribution, it is possible to estimate the extinction.
This information is used to derive photometric calibrations for the data taken by the main
The Auxiliary Telescope camera produces 2-dimensional CCD images, but the headers and
sky, are ecorded. The Auxiliary Telescope slews 1:1 with the main LSSTCam, which implies
two exposures every 39 seconds.
From the point of view of LSST Data Facility Services for Observatory Operations, the spec-
dentlyfromthemainLSSTCam. Thus, theoperationsofandchangestoLSSTDataFacility
(CDS) for the single CCD in the spectrograph uses a readout system based on the LSSTCam
electronics and will present an interface for the Archiver to build FITS files. Telescope data
The Spectrograph Archiving Service reads pixel data from the Spectro-
graph verison of the CDS and metadata available in the overall Observatory Control System
and builds FITS files. The service archives the data in a way that the data are promptly avail-
able to Observatory Operations via the Observatory Operations Data Service, and that the
data appear in the Data Backbone.
data from LSSTCam. Keeping in mind the differences between the two systems, the con-
cept of operations for LSSTCam archiving apply (see section on LSSTCam Archiving Service).
One differing aspect is that these data are best organized temporally, while some data from
There is no prompt processing of Spectrograph data in a way that is analogous to the prompt
22.214.171.124 Operational Scenarios
126.96.36.199.1 Change Control
Upgrades to the Spectrograph Archiving Service are produced
2.4 EFD ETL Service
The Engineering and Facility Database (EFD) is a system used in the context Observatory Op-
erations. It contains all data, apart from pixel data acquired by the Level 1 archiving systems,
of interest to LSST originating from any instrument or any operation related to observing.
TheEFDisanessential ecordoftheactivitiesofObservatoryOperations. Itcontainsdatafor
which there is no substitute, as it ecords raw data from supporting instruments, instrumental
Backbone and transforming this data into a format suitable for offline use.
The Original Format EFD, maintained by Observatory Operations, is
conceived of as collection of files and approximately 20 autonomous relational database in-
inal Format EFD have a temporal organization. This organization supports the need within
Observatory Operations to support high data ingest and access rates. The data in the Origi-
nal Format EFD is immutable, and will not change once entered.
The EFD also includes a large file annex that holds flat files that are part of the operational
the time series data to raw images and computed entities produced by L1, L2, and L3 pro-
cessing, and to hold these quantities in a manner that is accessible using the standard DM
1. There is a need to access a substantial subset of the Original Format EFD data in the
general context of Level 2 data production and in the Data Access Centers. This access
data to raw images and computed entities produced by L1, L2, and L3 processing.
2. To be usable in offline context, files from the Original Format EFD need to be ingested
into the LSST Data Backbone. This ingest operation equires provenance and metadata
3. Because the Original format EFD is the fundamental ecord related to instrumentation,
the actions of observers, and related data, the data contained within it cannot be re-
computed, and in general there is no substitute for this data. Best practice for disaster
188.8.131.52 Operational Context
Ingest of data from the Original Format EFD into the Re-
formatted EFD must be controlled by Observatory Operations, based on the principle that
Observatory Operations controls access to the Original Format EFD esources. The prime
framework for controlling operations is the OCS system. Operations in this context will be
The query load applied by general staff on the Original Format EFD at the
inal Format EFD operations in the period where LSST Operations occurs. Observatory Oper-
ations will copy database state and related operational information into a disaster recovery
store at a frequency consistent with a Disaster Recovery Plan approved by the LSST ISO. The
LSST Data Facility will provide the disaster recovery storage esource. The DR design proce-
dure should consider whether normal operations may begin prior to a complete estore of
If future operations of the LSST telescope beyond the lifetime of the survey do not provide for
operation and access to the Original Format EFD, the LSST Data Facility will assume custody
of the Original Format EFD and arrange appropriate service for these data (and likely move
the center of operations to NCSA) in the period of data production following the cessation of
LSST operations at the Summit and Base Centers.
LSST staff working on behalf of any operations department will have access to the Original
device, the LSST Data Facility will ingest the designated contents of the file annex of the Orig-
inal Format EFD into the data backbone. The LSST Data Facility will arrange that these files
participate in whatever is developed for disaster recovery for the files in the Data Backbone.
These files will also participate in the general file metadata and file-management service as-
sociated with the Data Backbone, and thus be available using I/O methods of the LSST stack.
184.108.40.206.3 Reformatted EFD Operations
• The Reformatted EFD is replicated to the US DAC and the Chilean DAC.
• LDFwillextract,transformandloadintotheReformattedEFDpointersto filesthathave
been transferred from the EFD large file annex into the Data Backbone.
• LDF will extract, transform and load designated tabular data from the Original Format
EFD into the Reformatted EFDs residing in the Data Backbone at NCSA and the Base
• “Designated” data will include:
which may choose to host a copy.
220.127.116.11 Operational Scenarios
erations will periodically test a estore in a disaster recovery scenario.
matted EFD relational database be eproducible from the Original Format EFD, disaster re-
operations of the Reformatted EFD relational database and ETL capabilities from the Origi-
nal Format EFD. Ingested files from the file annex can be recovered by the general disaster
18.104.22.168 Operational Scenarios
This section describes the services provided to Observatory Operations to access data that
satisfies the equirements that are unique to observing operations. These equirements in-
The Observatory Operations Data Service provides fast-access to re-
cently acquired data from Observatory instruments and designated datasets stored in the
and tools working in the context of Observatory Operations. The quality of service (QoS)
needed for these data is distinct from the general availability of data via the Data Backbone.
Access to data provided by the Observatory Operations Data Service is distinguished from
normal access to the Data Backbone in the role the data play in providing straightforward
feedback to immediate needs that support nightly and daytime operations of the Observa-
tory. Newly acquired data is also a necessary input for some of these operations. The service
LSST Data Facility to Observatory Operations, and is used by observers and automated sys-
needed to support Observatory Operations for a subset of the data that is critical for short-
providing access to a subset of data at a QoS that is different (and higher) than the general
Data Backbone. Less critical data is provided to Observatory Operations by the Data Back-
bone, which provides service levels provided generally to staff anywhere in the LSST project.
For general access to the data for assessment and investigation at the Base Center, the ser-
vice level is the same for any scientist working generally in the survey.
The Observatory Operations Data Service is instantiated at the Base Center. Therefore, the
Observatory Operations Data Service does not directly support activities which must occur
The service operates in-line with the Spectrograph and LSSTCam Archiving Services. Newly
acquired raw data are first made available in the Observatory Operations server, and then
The intent is to provide access to:
fied by policy. An example policy is “last week’s raw data”.
• OtherdataasspecificallyidentifiedbyObservatoryOperations.Thismaybe file-based
data or data resident in database engines within the Data Backbone.
File system export: The Observatory Operations Data Service pro-
Butler interfaces: Use of the LSST Stack is advocated for Observatory Operations, and so
access to this data is possible via access methods supported by the LSST stack. The standard
access method provided by the LSST stack is through a set of abstractions provided by a
context, and updates that context continuously as new data (for example, new raw images)
Native interfaces: Not all needed application in the Observatory Operations context will use
Http(s) interface: The Observatory Operations Data Service also exposes its file system via
http(s). Use of the Observatory Authentication and Authorization system is equired for this
• Concern: The need includes a continuously updated window of newly created data, in
contrast to the other Butler use cases. How well the current set of abstractions work in
a system that is ingesting new raw data is unknown to the author.
ingthesedesiderataincludesolutionsfromanETLinto flat files,toestablishingmirrored
subsets of files. It is likely that if relational data are needed, caches of relational data will
need to be made by extract, transform and load into a file-format such as SQLite.
• Concern: This service needs to be available in TBD operational enclaves. (and limited to
22.214.171.124 Operational Scenarios
Back to top
TheLSSTDataFacilityprovidesspecific “offline” (i.e.,notcoupledtoObservatoryOperations)
• Annual Release Processing: Processing of payloads of tested work flows at NCSA and
• Calibration Processing: processing of payload tested work flows at NCSA and satel-
lite sites through and including ingest of release products into file stores, relational
duction occurs at various cadences from potentially daily to annual, depending on the
• Batch framework upgrade testing: Test suites run after system upgrades and other
• Payload Testing Verification and validation: of work flows from the continuous build
Batch production service operations consists of executing large or
Batch production services execute designated processing campaigns to
duction, “after-burner” processingtomodifyoraddtoadatarelease,at-scaleintegrationtest-
ing, producing datasets for data investigations, and other processing as needed. Campaign
• A campaign is a set of pipelines, a set of inputs to run the pipelines against, and a
method of handling the outputs of the pipelines.
• A campaign satisfies a need for data products. Campaigns produce the designated
batch data products specified in the DPDD [LSE-163], and other authorized data prod-
• Campaigns can be large, such as an annual release processing, or small, such as pro-
group. Each pipeline is an ordered sequence of individual steps. The output of one or more
steps may be the input of a subsequent step downstream in the pipeline. Pipelines may pro-
duce final end data products in release processing, may produce calibrations or other data
products used internally within LSST operations, may produce data products for investiga-
A campaign is the set of all pipeline executions needed to achieve a LSST objective.
• Each campaign has one or more pipelines.
is a way to identify the input data needed for each invocation.
• Each campaign has an adjustable campaign priority reflecting LSST priority for that ob-
• Eachpipelineinvocationmay equireoneormoreinputpipelinedatasets.
• Each pipeline invocation produces one or more output pipeline data sets. Notice that,
for LSST, a given file may be in multiple data sets.
• For each input pipeline data set there is a data handling scheme for handling that data
• For each output pipeline data set there is a data handling scheme for handling that data
set in a way that outputs are properly archived.
The key factor in the nature of the LSST computing problem is the inherent trivial parallelism
due to the nature of the computations. This means that large campaigns can be divided into
to be run outside the batch production service environment. For example, alternate envi-
From these considerations, the LSST Data Facility separates the concerns of a reliable pro-
duction service from these other use cases, which do not share the concerns of production.
Computing esourcesareneededtocarryoutacampaign. BatchprocessingoccursonLSST-
dedicated computing platforms at NCSA and CC-IN2P3, and potentially on other platforms.
to hold final data products and network connectivity, are also needed to completely execute
a pipeline and completely realize the data handling scheme for input and output data sets.
Computing esourcesarephysicalitemswhicharenotalways fitforuse. Theyhavesched-
campaigns, provided by the Master Batch Job Scheduling Service equires:
1. thedetectionofunscheduleddowntimesof esources
3. bestuseofavailable esources.
Oneclassofpotential esourcesareopportunistic esourceswhichmaybeverycapaciousbut
notguaranteethatjobsruntocompletion. These esourcesmaybeneededincontingency
The types of computing platforms that may be available, with notes, are as follows.
NCSA batch production
Ethernet cluster with competent cluster file
NCSA L1 computing for
chines + competent caches and large volume
fill context on HPC computers. [Backfill con-
An Orchestration system is a system that supports the execution of a pipeline instance. The
(e.g., MySQL to SQLite, flat table, etc.)
• In-job context:
Providesany “pilotjob” functionality.
Provides stage-out for pipeline output data sets when stage-out equires job con-
Arranges for any post-job stage out from cluster file systems
• Considers the ensemble of available compute esources and the ensemble of cam-
• Dispatches pipeline invocations to an Orchestration System based on esource availabil-
• Considers pipeline failures eported by the Orchestration System.
Identifies errors indicative of a problem with computing esources, and arranges
Identifies some computational errors, and arranges for incident eport.
• Provides appropriate logging and events (N.b. critical events can be programmed to
• Quality Assurance (QA) is what people do. This is identifying the issue and arranging for
fixes. One source of input is quality controls, described below. Another source of input
• A Quality Control (QC) is a software artifact that produces some sort of data that con-
likely etrospective purpose.
Fed as input into active quality control which is software that automatically affects
duct a number of concurrent campaigns that support LSST goals. These campaigns will be
the LSST batch infrastructure with certain Level 1 services that equire offline processing and
The system is programmed to deal with anticipated errors. Human eye is applied during
working hours, and can be summoned when events in the underlying systems generate inci-
Each campaign is monitored for technical progress, both in in the sense of analyzing and
responding to overtly flagged errors, and general monitoring and human assessment of the
1. Quality controls are considered by an LSST Data Facility Production Scientist and other
staff. Data Facility staff apply any standard authorized mitigations, such as eprocessing,
flagging anomalies, etc. The Production Scientist within the LSST Data Facility under-
stands the full suite of quality controls, alerts Science Operations group to anomalies,
scientific acumen to assess the data products at a first level, in addition to monitoring
(b) aprocessingcampaignthatis esourceintensive,henceexpensivetoredo(orhas
(c) known problematic output data sets that are not adequately covered by existing
(d) known problematic input data sets not adequately covered by existing quality con-
entific quality assurance in the project. Information obtained from first order quality assur-
126.96.36.199 Operational Scenarios
specifying an initial set of pipelines, a coverage set, and an initial priority. The Batch Pro-
duction Service is consulted with a reasonable lead time. Consistent with LSST processes,
pipelines can be modified or added (for example, in the case of after-burners) during a cam-
paign. These changes and additions are admitted when the criteria of change control pro-
• the impact of esource-intensive campaigns is approved and understood
uled so that it occurs after the review is completed. This includes backing out files, materials
from databases, and other production artifacts from the Data Backbone, and maintaining
production ecords as these activities occur.
188.8.131.52.3 Pause campaign
Stop a long running campaign from proceeding allowing for
and be maintained. There will be the campaigns, described in the operations documents. It
focused scrutiny on aspects of production for some pipeline. In many cases problems will be
needs to support mustering focused effort on quality analysis that is urgent, and lacks an
adequate basis for robust quality controls. The LSST Data Facility Batch Production Services
staff contribute effort to to solve these problems, in collaboration with Science Operations
ately as the associated pipelines terminate or after a period of time when inspection pro-
cesses run. Such data need to be marked such that they will not be included in release data
and will be set aside for further analysis.
184.108.40.206.6 Dealwithsuddenlack(orsurplus)in esources
fewjobs,oncetheworkloadmanagerisawareofthenewlevelof esources. Thetechnical
system responds to hardware failures on a running job in just like any other system – with
the ultimate recovery being to delete an partial data and etry, while respecting the priorities
Back to top
The LSST Data Facility provides authorized users and sites access to data via a set of services
that are integrated with the overall Authentication and Authorization (AA) System. These
services are hosted by LSST Data Facility at the US and Chilean Data Access Centers and will
The LSST Data Facility hosts the alert distribution system and supports users of the LSST
Back to top
5 Data,Compute,and ITSecurityServices
The LSST Data Facility provides a set of general IT services which support the LSST use-case-
specificservicesmentionedinprevioussections.These “undifferentiatedheavinglifting” ser-
• DataBackboneServicesproviding fileingestion,management,movementbetweenstor-
• Managed Database Services providing database administration for all database tech-
on each LSST-provided platform at NCSA and the Base Center.
• Network-based ITSecurityServicesprovidingproject-wideintrusiondetection,vulnera-
The concept of operations for each of these services is described in the following sections.
The Data Backbone is a set of data storage, management and movement esources dis-
tributed between the Base Center and NCSA. The scope of the Data Backbone includes both
survey and used by L1, L2, and Data Access Center services.
The Data Backbone provides read-only data service to the US and Chilean DACs, but does
not host data stores where DAC users create state. This is done to create a hard and easily
enforceable separation of technologies, where no flaw in a DAC can corrupt the data pro-
ducedbyL1andL2productionsystems. Forexample,DAC esourcessuchasQservanduser
The Data Backbone ensures that designated data sets are replicated at both sites. The data
• Replication of designated file data within LSST Data Facility sites at NCSA and the Base
• Replication of designated relational tables and data maintained in other database en-
gines at NCSA and the Base Center.
• Implementationofpolicy-based flowstothedisasterrecoverystores.Atthetimeofthis
• Ingest of data from TBD other sources, approved by a change control process.
• Serving data to the US and Chilean Data Access Centers.
• Integrity checking and other curation activates related to data integrity and continuity
through the lifetime of the LSST project, which at the time of this writing is seen as serving
• Obtainingalogical filenamethroughqueryingmetadataandprovenance.
• Possibly migrating a file from a medium where the file is not directly accessible, such as
tape, to a medium where the file is accessible.
• Selectingadistinctinstanceofthe filefrompossiblymanyreplicas.
• Accessing the file though an access method such as a file system mount or Http(s).
The project has identified several caches of data that are used in production circumstances.
The distinguishing circumstances for these caches involve quality of service equirements
forperformanceandavailability. AbsentsophisticatedQoSin filesystems,performancere-
quirements are met by controlling access to the underlying storage via caching. Availability
location information. Application-level cache management provides path names within the
Casual use of data for short periods may rely on knowledge such as file paths, but is subject to
disruption when paths are e-arranged, or should the underlying storage technology change,
the database technology involved. Databases identified as holding permanent ecords of the
survey are in the Data Backbone in the sense that they are instantiated in the context of
a security enclave with management, operational, and technical controls needed to assure
preservation of this data, and that the principal concern of enclave management is that data
reside at the Base and at NCSA driven by business need.
220.127.116.11 Operational Scenarios
18.104.22.168.1 Availablity and Change Management
Catalog-based access systems such as
of the file store and its access methods.
• Data originate outside of the LSST Data Facility, but are (or are potentially) used in L1
a L1 or L2 computation.
• Data originate as data produced within the L1 or L2 production processes, and are
meant to be retained for some period of time.
or other aspects of the LSST Data Facility.
In LSST, a distinction is made between patterns of storage of data in a database engine
(schema, for purposes here) and an implementation of the schema in a database engine
which stores the data. In LSST, common schema are used and shared in many scenarios in
distinct but schema-compatible databases.
As an example, a common relational database schema can be used in development, unit
TheprimaryfocusofManagedDatabaseServicesis, asoutlined, not
the support of developers, but the support of production and data needing custody or cura-
means is all schema designed by Managed Database Services. Managed Database Services
does have a role in determining the fitness for use of any schema present in databases it
22.214.171.124 Operational Context
The operational context for Managed Database Services is
the context of the LSST Data Backbone within the LSST Data facility.
• Present a managed database service hosting the equired schema
• Support the evolution of schema in a managed database service
Batch Computing and Data Staging Services provide primitives used by the Master Batch Job
Scheduling Service. Batch Computing and Data Staging Services are provides in a distinct
Batch Computing and Data Staging Services are provided at NCSA and the Base Center.
Analogous (but not identical) services are provided by MoU to the LSST Data Facility by CC-
running in the context of each enclave.
Data Access Center from this pool.
and US Data Access Centers enclaves from this pool.
bernetes provisioning. An enhanced goal is the unification of esource management of the
Data staging efers to mechanisms needed to move data, primarily files, between the Data
between mounted file systems, or as complex as a staging via http or FTP.
• Provide any enclave-specific esources. An example is distinct head nodes for different
• Provideenclave-specificconfigurations, includingconfigurationsneededforinforma-
• Integrate ITCintothebatchsystem.
needed for batch operations in an enclave will change.
Operating conditions may change as well. For example even with container abstractions, it
Lastly,NCSAhassubstantial esourcesforpromptprocessing,suchasalertproduction. Schedul-
ing jitter and performance may preclude using a single batch scheduler for general offline
5.4 Containerized Application Management Services
eachenclave,butareprovisionedonacommonpoolof ITC esourcesresidingintheMaster
Center and one at NCSA. These instances are the basis for servicing elastic computing needs
at each site and a portable abstraction for symmetric deployment on commercial provision-
ture that logically partitions a pool of Kubernetes esources to various enclaves at the respec-
in the context of each enclave.
providing a containerized application management system compatible with LSST equire-
• Provide containerized application management to each enclave, respecting enclave-
clave at each site and expose them to a given enclave, implementing policies appropriate to
needed for elastic services in an enclave will change.
Operating conditions may change as well. For example, even with container abstractions,
5.5 Network-based ITSecurityServices
The LSST Network-based IT Security Service provides technical con-
trols for operational security assurance. These controls provide data that support the LSST
Master Information Security Plan and IT security processes such as incident detection and
• Network Security Monitoring, including monitoring of high-rate data connections for
data transfer across the LDF system boundaries (but excluding certain high rate trans-
fers, such as the Level 1 service access to the Camera Data System), including deploy-
• Thetechnicalframeworktofacilitateefficient IncidentDetectionandResponse,includ-
• Management of certain access controls, such as firewalls and bastion hosts, used for
thatthereisaLSST InformationSecurityOfficer(ISO)who eportstotheHeadoftheLSST
construction project, and will transition to eport to the Head of the Operations project.
The ISO drafts a Master Security Program [LPM-121] plan, which the Head approves of as
for the residual risk of the plan. This is the security risk that remains, given faithful execution
oftheplan. The ISOoverseesimplementationandevolutionoftheplan, seeingthatitis
faithfully implemented and noting when mitigation and changes are needed. The ISO does
any equiredstaff workfortheHead;forexample,runningstaff training. The ISOisinformed
of and keeps ecords on security incidents, and is responsible for evolution of the security
The ISO is responsible for a Information Security Response Team, which deals with actual or
latent potential breaches in information security. The Incident Response Team is made of a
set of draftees from the various operations departments, with the draft weighted towards
The ISO runs the annual security plan assessment. The management of each construction
• Acomprehensivelistof ITassets,applicationsandservices.
• A list of security controls the department applies to each asset (technical and opera-
• A list of controls supplied by others that are relied on.
These controls apply to all offered services and all supported ITC. Reporting is easiest if the
systems offered are under good configuration control. Under a good system, the security
plans are living and updated by an effective change control process.
Verification: The ISO oversees a group that provides network-based security services de-
scribed in the Objective part of this concept of operations.
A general approach to LSST-specific networking is the use of software-defined networking.
This provides for isolation of networking supporting security enclaves. In particular, this al-
The context for these security enclaves cover the following production services in the LSST
project, though other enclaves may join if feasible and desired by the relevant operational
US Data Access Center Ser-
Critical Observing Enclave
These are the standard elements of an information security infrastructure
which are needed for a credible IT security project. Certain elements of the system are near
area will be seen as a flaw in the overall construction plan, preparing the LSST MREFC for
126.96.36.199 Normal Operations
The following elements provide the functionality needed to
• Intrusion Detection Systems (IDS) detect patterns of network activity that indicate at-
tacks on systems, compromise of systems, violations of Acceptable Use of systems,
for firewall audits and ARP scanning for network asset management can also be con-
for storage (making logs invulnerable for an attacker to modify) and processes the logs
• Firewalls and bastion hosts provide a layer of active security. A typical use of a bastion
host is to provide a layer of security between networks used to administer computers
• Host-based IntrusionDetectioncomplementsnetworkmonitoringbydetectingactions
within a system not visible from the network with tools such as auditd and OSSEC. This
componentalsomonitorsthe filesystemsandchecksfor filesystemintegrity.
• Active Response blocks communication with entities outside the observation site net-
works. This component is typically used to block “bad actor” entities outside the obser-
Newsystemsbeingdeployedmustbe “hardened” toasecuritybaselineandvettedbysecurity
188.8.131.52 Operational Scenarios
Vulnerability scanning periodically assesses designated ports
in a crucial case is assessing the effectiveness of the program of work patching a critical vul-
Intrusion detection can detect, for example, an attempt to compromise a system. The detec-
tion system interacts with the active response system to cut off the attacker’s access to the
computer. The intrusion detection system can also be used to aid in the investigation of an
Host-based Intrusion Detection checks for attacks against a host from the perspective of the
host. Examplesincludemultiplefailedremoteloginsas eportedbythehost,or eportsof file
system changes that do not accompany an approved request for change or do not fall within
The networks at the Observatory site must be monitored by intrusion detection systems.
Acceptable IDS solutions include Bro and Snort. These systems must be able to handle the
traffic load from various network segments at 10GB to 100GB speeds. The IDS systems must
be placed at strategic locations and account for any expansion or changes in the network
without the need to completely retool the IDS systems.
The information produced by the system is accessible by LSST staff involved in LSST informa-
tion security, “landlords” hosting systems, and other parties with a valid interest in the data,
border outersonanetworkandofferstheshortest outertodestinationsbeingblocked.
Quagga and ExaBGP are two examples of BHR software.
The central Configuration Management System will enforce a security baseline and config-
uration on all systems. Examples of this technology are Puppet and Chef. In the event that
forpatchingwillhavetobeavailable. Itis equiredthatthissystemwouldalsobetheDomain
The central log collectors are responsible for collecting and archiving all logs collected as de-
scribed in the previous section. The collectors must be able to store at minimum six months
of logs with a rotating windows deleting the oldest logs to maintain disk space. In addition
tothelogcollectors, thereisaSEIM/analysissystem. Thissystemisusedforreal-timelog
alerts, searches, and visualization. ElasticSearch, Kibana, and OSSEC are three examples of
such software. This server spools a copy of the logs from the central collectors but may not
be able to keep the full time window due to overhead or log metadata storage.
Allsystems, bothworkstationsandservers, are equiredtosendsystemlogstoacentral
collector. For Linux systems, syslog must be configured to send a copy of all logs in realtime
to the central collector. For Windows systems, software such as Snare will be installed to
exist for log collection such as logstash an open source log collection tool that can be used to
collect logs from a wide variety of platforms.
Networkdevicesarealso equiredtosendsystemlogstothecentralcollector. Notethatthis
is different than any network logs a switch or outer would send. The system logs efer to
events such as device logins, configuration changes, and other system specific events. Note
Networkdevicessuchas outersor firewallsthatareplacedontheingress/egresspointsofa
vlanorthenetworkmustsend firewallor outerACLlogstothecentralcollector.
It is best practice for network devices to also send netflow to a central collector. However, if
netflow is collected and forwarded to the central collector it must not be at the expense of
Any other devices not classified as a server, workstation, or networking device must be con-
figured to send logs to the central collector if this feature is available. An example of a device
that falls into this category is a VOIP appliance or a VPN appliance.
5.6 Authentication and Authorizations Services
Back to top
ITC is managed in distinct enclaves. Enclaves are defined based on administrative and secu-
rity controls, and operational availability equirements. Enclaves may span geographic sites,
with elements in both the Base Facility in La Serena and at NCSA. Enclaves may share comput-
ing and other esources. Central administration is operated by NCSA staff, including remote
administration of the Base Facility, with pair-of-hands" support staff in Chile.
ing Enclaves provide administrative, security, and core computing infrastructure that
• Level 1 Enclave, which spans Chile and NCSA to support prompt processing and archiv-
• US Data Access Center Enclave, which presents for each authorized users the ability to
query the Qserv database, make custom state in MyDB databases and user areas on file
to query the Qserv database, make custom state in MyDB databases and user areas on
file systems, access a custom JupyterHub, access files via shell, and submit batch jobs.
• DataBackboneEnclave,spanningChiletoNCSA,whichhoststheprimary ecordofthe
1services,dataproducedbyprojectprocesses, filesfromthelarge fileannexoftheEFD,
and relational databases that are not part of the DACs.
• Wide Area Network, which provides connectivitiy between border outers of La Serena,
Back to top
The LSST Data Facility provides a set of services supporting overall management of services,
Servicemanagementprocessesaredrawnfromthe ITIL ITServiceManagementvocabulary.
fromthe InformationTechnology InfrastructureLibrary(ITIL)whichisanindustry-standard
1. Service Design: Building a service catalog and arranging for changes to the service of-
and controlling the order and timing of inserting changes into the system.
ment of the success of these changes.
interacts with project producing a specific change to ensure
that a complete change is presented to change management for approval into the
live system. Examples areas that are typically a concern are accompanying docu-
provides an accurate model of the components in the
3. Service Delivery: operating the current set and configuration of production services.
and status of all operational services within its scope. The monitoring system deals with
quality controls related to service delivery. These data have both etrospective and real-time
• Acquires data from subordinate monitoring systems within components that are not
bespoken LSST software. These monitoring systems may have an API, log files, SNMP,
(pex.log), event package (ctrl_events), L1 logging, L1 events, scoreboards (Redis), TBD
• Synthesizes new quality control data from existing quality control data (for example,
correlating a series of events before generating an event that will issue a page).
• Can generate events based on performance or malfunction which can trigger incident
• Can generate eports used for problem management, availability management, capacity
• Is sensitive to dynamic deployment of services to ITC esources.
ing alerts, painting displays, and ecording data for etrospective use (concerns segre-
and at CC-IN2P3, as well as the networks between these sites.
A dataset based on the operational characteristics of the facilities, hardware, software and
other elements of service infrastructure is needed to support service management, service
delivery, service transition, and ITC-level activities, as well as to provide health and status in-
formation to the users of the systems. This dataset must be substantially unified, so that all
activities are supported by a single source of truth. From a unified dataset, for example, staff
concerned with availability management of a service can obtain ecords that consistently re-
with capacity management can obtain information on how capacity is provided by ITC activi-
In general, service management needs both a subset of the data that is needed for ITC man-
not supplied by ITC monitoring include the end-to-end availability of a service that tolerates
hardware faults, user-facing comfort displays which address specific areas of interest, and
Center, satellite computing centers, test stands at LSST Headquarters, wide area networks,
specific (non-uniform) ITC monitoring and service management information on which LDF
services rely. In all cases the LSST Data Facility needs to centrally acquire sufficient data
to provide for management of LDF services, while minimizing coupling to the ITC or service
provisioning from these sites. The coupling should be defined in an internal Service Level
in the table below. The monitoring system may acquire additional essential data by agents,
Subordinate monitoring interfaces pro-
subsystem to Chilean border outer
Network transport on UIUC campus to L1 in-
U of I networking, NCSA networking
Observatory Operations, Computing Facility
ITC for L1 system, exclusive of reliances listed
LDF ITC group or relied upon NCSA groups
NCSA/NPCFfacility esourcemanagement NCSA/NPCFfacilitymanagement
Service-specific code and service-level per-
formance as a part of the overall system,
Events indicating service
formation about marginal
or near-miss events de-
Staff, and LSST HQ
real-time status of services
NCSA staff should be able to see
the same information as Observa-
tory Operations staff to prevent con-
portant to note that monitoring re-
lied on by Observatory Operations
in Chile having reliances on NSCA
need to operate and provide appro-
Information about when
alerts are being exported
Availabilitymanagement Queries, eports and dis-
plays focused on historical
contributions to failures by
Capacitymanagement Queries, eports and dis-
plays focused on historical
Contract and SLA manage-
Queries, eports, and dis-
plays of quantities related
to performance, e.g., re-
184.108.40.206.2 Batch Production Services
WAN will be managed by different entities (ISPs) based on who owns the particular section
of the network. Typically all ISPs run their own SNMP network to monitor the health of the
devices. LDF service monitoring taps into this information base and collects and forwards it
levels of users as listed below:
The level of access and response capabilities will be as defined in the user-profile. In the case
of a “Generic User,” it maybe necessary onlyto showif the LSSTsystem is upand running.
For a “Super User” who will have access to detailed status information on the systems and
subsystems, will be able to see in-depth event history and status eports (through log-scraping
and fault databases). The Super User will also be able to access the logs database through
In-between levels of access will be defined as per the definition of the roles and responsibili-
ties of the user.
Back to top
, Dubois-Felsmann, G., 2016,
Auxiliary Instrumentation Interface between Data
Management and Telescope
, Jurić, M., etal., 2017,
, Petravick, D.L., Withers, A., 2016,
LSST Master Information Security Policy
, Withers, A., 2017,
Concept of Operations for Unified LSST Authentication and Au-
Back to top
Back to top