1. LargeSynopticSurveyTelescope(LSST)
  2. S17BHSCPDR1ReprocessingReport
  3. Hsin-FangChiang,GregDaues,SamanthaThrush,andtheNCSA
  4. DMTR-31
  5. LatestRevision: 2017-08-28
  6. Abstract
  7. ChangeRecord
  8. Contents
      1. 1Dataset Information 1
      2. 2Hardware 2
      3. 3Software 3
      4. 4 ProcessingNotes 7
      5. 5 ResourceUsage 13
      6. 6 TriggeredJIRAtickets 22
  9. S17BHSCPDR1ReprocessingReport
  10. 1 Dataset Information
  11. 2 Hardware
  12. 3 Software
      1. 3.1 SoftwareStackVersion
      2. 3.2 PipelineStepsandConfigs
      3. 3.3 Unitsof IndependentExecution
      4. 3.4 ExampleCommandsforProcessing
      5. 3.5 ButlerHscMapperPolicyTemplates
  13. 4 ProcessingNotes
      1. 4.1 ProcessingoftheReleaseCandidate
      2. 4.2 SummaryofOutputs
      3. 4.3 ReproducibleFailures
      4. 4.4 Low-levelProcessingDetails
  14. 5 Resource Usage
      1. 5.1 DiskUsage
      2. 5.2 CPUUsage
      3. 5.3 NodeUtilization
  15. 6 TriggeredJIRAtickets
  16. References

LargeSynopticSurveyTelescope(LSST)

Back to top


S17BHSCPDR1ReprocessingReport

Back to top


Hsin-FangChiang,GregDaues,SamanthaThrush,andtheNCSA
team

Back to top


DMTR-31

Back to top


LatestRevision: 2017-08-28

Back to top


Abstract
This document captures information about the large scale HSC eprocessing we per-
formedinCycleS17B.
LARGESYNOPTICSURVEYTELESCOPE

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28

Back to top


ChangeRecord
Version Date
Description
Ownername
1.0
2017-08-21 Initial eport.
Hsin-FangChiang
1.1
2017-08-28 Initial eportwithminorrevisions.
Hsin-FangChiang
ii

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28

Back to top


Contents
1Dataset Information
1
2Hardware
2
3Software
3
3.1 Software Stack Version ................................... 3
3.2 Pipeline Steps and Configs ................................. 3
3.3 Units of Independent Execution ............................. 4
3.4 Example Commands for Processing ........................... 5
3.5 Butler HscMapper Policy Templates ........................... 5
4 ProcessingNotes
7
4.1 Processing of the Release Candidate ........................... 7
4.2 SummaryofOutputs .................................... 9
4.3 ReproducibleFailures.................................... 10
4.4 Low-levelProcessingDetails................................ 11
5 ResourceUsage
13
5.1 DiskUsage .......................................... 13
5.2 CPUUsage .......................................... 16
5.3 NodeUtilization ....................................... 22
6 TriggeredJIRAtickets
22
iii

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28

Back to top


S17BHSCPDR1ReprocessingReport

Back to top


1 Dataset Information
The input dataset is the HSC Strategic Survey Program (SSP) Public Data Release 1 (PDR1) (Ai-
hara et al., 2017). The PDR1 dataset has been transferred to the LSST GPFS storage /datasets
by
DM-9683
and the butler repo is available at /datasets/hsc/repo/.
Itincludes5654visitsin7bands: HSC-G,HSC-R,HSC-I,HSC-Y,HSC-Z,NB0816, NB0921. A
fileincludingallvisit IDsisattachedtotheconfluencepage. Theofficialreleasesiteisat
https://hsc-release.mtk.nao.ac.jp/Thesurveyhasthreelayersandincludes8 fields:
1.
UDEEP
:SSP_UDEEP_SXDS,SSP_UDEEP_COSMOS
2.
DEEP
:SSP_DEEP_ELAIS_N1,SSP_DEEP_DEEP2_3,SSP_DEEP_XMM(S)_LSS,SSP_DEEP_COSMOS
3.
WIDE
: SSP_WIDE, SSP_AEGIS
The number of visits in each field and band is summarized in Table 1.
HSC-G HSC-R HSC-I HSC-Z HSC-Y NB0921 NB0816
Layer FieldName("OBJECT")
Numberofvisits
DEEP SSP_DEEP_ELAIS_N1 32 24 28 51 24 20
0
DEEP SSP_DEEP_ELAIS_N1 32 24 28 51 24 20
0
DEEP SSP_DEEP_DEEP2_3 32 31 32 44 32 23
17
DEEP SSP_DEEP_XMM_LSS 25 27 18 21 25 0
0
DEEP SSP_DEEP_COSMOS 20 20 40 48 16 18
0
UDEEP SSP_UDEEP_SXDS 18 18 31 43 46 21
19
UDEEP SSP_UDEEP_COSMOS 19 19 35 33 55 29
0
WIDE SSP_AEGIS
8
5
7 7 7 0
0
WIDE SSP_WIDE
913 818 916 991 928 0
0
Table 1: Number of visits in each field and filter
Thetract IDsforeach field,obtainedfromhttps://hsc-release.mtk.nao.ac.jp/doc/index.php/database/
1

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
is summarized in Table 2.
Layer FieldName("OBJECT") Tract IDs
DEEP SSP_DEEP_ELAIS_N1 16984,16985,17129,17130,17131,17270,17271,17272,17406,17407
DEEP SSP_DEEP_DEEP2_3 9220, 9221, 9462, 9463, 9464, 9465, 9706, 9707, 9708
DEEP SSP_DEEP_XMM_LSS 8282, 8283, 8284, 8523, 8524, 8525, 8765, 8766, 8767
DEEP SSP_DEEP_COSMOS 9569,9570,9571,9572
1
,9812,9813,9814,10054,10055,10056
UDEEP SSP_UDEEP_SXDS 8523,8524,8765,8766
UDEEP SSP_UDEEP_COSMOS 9570,9571,9812,9813,9814,10054,10055
WIDE SSP_AEGIS
16821,16822,16972,16973
WIDE SSP_WIDE
XMM:8279-8285,8520-8526,8762-8768
GAMA09H:9314-9318,9557-9562,9800-9805
WIDE12H: 9346-9349, 9589-9592
GAMA15H: 9370-9375, 9613-9618
HECTOMAP:15830-15833,16008-16011
VVDS:9450-9456,9693-9699,9935-9941
Table2: Thetract IDsineach field
Plots of tracts and patches can be found on
https://hsc-release.mtk.nao.ac.jp/doc/index.
php/data/.
In S17B, we attempted to process some edge tracts not listed in Table 2 but those
datawerenotremovedfromtheoutput epositories. Thosedatacanbeignored;seeSection
4.2.

Back to top


2 Hardware
TheprocessingwasdoneusingtheLSSTVerificationCluster.TheVerificationClusterconsists
of 48 Dell C6320 nodes with 24 physical cores (2 sockets, 12 cores per processor) and 128 GB
RAM. As such, the system provides a total of 1152 physical cores. lsst-dev01 is a system with
24 physical cores, 256 GB RAM, running the latest CentOS 7.x that serves as the front end of
theVerificationCluster.
TheVerificationClusterrunstheSimpleLinuxUtilityforResourceManagement(SLURM)clus-
1
tract 9572 is listed on HSC PDR1 website for DEEP_COSMOS but no data actually overlap it; PDR1 does not
haveiteither.
2

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
ter management and job scheduling system. lsst-dev01 runs the SLURM controller and
serves as the login or head node , enabling LSST DM users to submit SLURM jobs to the
VerificationCluster.
lsst-dev01 and the Verification Cluster utilize the General Parallel File System (GPFS) to pro-
vided shared-disk across all of the nodes. The GPFS has spaces for archived datasets and
scratchspacetosupportcomputationandanalysis.

Back to top


3 Software
TheLSSTSciencePipelinesSoftwareStackisused. AsharedsoftwarestackontheGPFS
file systems, suitable for computation on the Verification Cluster, has been provided and is
maintainedbytheSciencePipelinesteamandisavailableunder /software/lsstsw.
3.1 SoftwareStackVersion
The stack version of w_2017_17, published on 26-Apr-2017, was used. Besides, the master
branch of meas_mosaic, obs_subaru, and ctrl_pool from 7-May-2017 and built with w_2017_17
was used. This is equivalent to the week 17 tag with
DM-10315,
DM-10449,
and
DM-10430.
3.2 PipelineStepsandConfigs
Unlessotherwisenoted,theHSCdefaultconfiginthestackisused,includingthetaskdefaults
and obs_subaru’s overrides. That implies the PS1 eference catalog ps1_pv3_3pi_20170110 in
the LSST format (HTM indexed) is used (/datasets/refcats/htm/ps1_pv3_3pi_20170110/). The
calibration dataset is the 20170105 version provided by Paul Price; the calibration repo is
located at /datasets/hsc/calib/20170105
(DM-9978).
The externally provided bright object
masks (butler type "brightObjectMask") of version "Arcturus"
(DM-10436)
are added to the
repoandappliedin coaddDriver.assembleCoadd.
The steps are:
1. makeSkyMap.py
3

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
2. singleFrameDriver.py
- Note: Ignore ccd=9 which has bad amps and results not trustworthy even if pro-
cessCcdpasses
3. mosaic.py
4. coaddDriver.py
- Note: Make config.assembleCoadd.subregionSize small enough so a full stack of
images can fit into memory at once; a trade-off between memory and i/o but
doesn’tmatterscientifically,asthepixelsareindependent.
5. multiBandDriver.py
6. forcedPhotCcd.py
- Note: itwasaddedlateandhencewasnotrunintheRCprocessing
Operational configurations, such as logging configurations in ctrl_pool, different from the
tagged stack may be used (e.g.
DM-10430).
InthefullPDR1 eprocessing, everythingwasrunwiththesamestackversionandconfig.
ReproduciblefailuresarenotedinSection4.3, butno eprocessingisdonewithanewer
softwareversion.
Thisstackversionhadaknownscienceproblemaboutbadellipticityresidualsas eportedin
DM-10482;
the bug fix
DM-10688
was merged to the stack on 30-May-2017 and hence was
not applied in the S17B eprocessing campaign.
3.3 Unitsof IndependentExecution
The pipelines are run no smaller than the units noted below:
1.
makeSkyMap.py
:OneSkyMapforeverything
2.
singleFrameDriver.py
:ccd(typicallyrunpervisit)
3.
mosaic.py
: tract x filter, including all visits overlapping that tract in that filter.
4

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
4.
coaddDriver.py
: patch x filter, including all visits overlapping that patch in that filter.
(typicallyrunpertract)
5.
multiBandDriver.py
:patch,includingall filters.(typicallyrunpertract)
6.
forcedPhotCcd.py
:ccd
Dataofdifferentlayers(DEEP/UDEEP/WIDE)areprocessedseparately.
3.4 ExampleCommandsforProcessing
1. makeSkyMap.py makeSkyMap.py /datasets/hsc/repo --rerun private/username/path
2. singleFrameDriver.py singleFrameDriver.py /datasets/hsc/repo
--rerun private/username/path --batch-type slurm
--mpiexec=?-bind-to socket? --cores 24 --time 600 --job jobName2
--id ccd=0..8^10..103 visit=444
3. mosaic.py mosaic.py /datasets/hsc/repo --rerun path1:path2
--numCoresForRead=12 --id ccd=0..8^10..103 visit=444^446^454^456
tract=9856 --diagnostics --diagDir=/path/to/mosaic/diag/dir/
4. coaddDriver.py coaddDriver.py /datasets/hsc/repo --rerun path2
--batch-type=slurm --mpiexec=?-bind-to socket? --job jobName4
--time 600 --nodes 1 --procs 12 --id tract=9856 filter=HSC-Y
--selectId ccd=0..8^10..103 visit=444^446^454^456
5. multiBandDriver.py multiBandDriver.py /datasets/hsc/repo --rerun path2
--batch-type=slurm --mpiexec=?-bind-to socket? --job jobName5
--time 5000 --nodes 1 --procs 12 --id tract=9856 filter=HSC-Y^HSC-I
6. forcedPhotCcd.py forcedPhotCcd.py /datasets/hsc/repo --rerun path2
-j 12 --id ccd=0..8^10..103 visit=444
3.5 ButlerHscMapperPolicyTemplates
The chosen software stack version (Section 3.1) implies the following templates in the Butler
epositories.
5

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
calexp: %(pointing)05d/%(filter)s/corr/CORR-%(visit)07d-%(ccd)03d.fits
calexpBackground: %(pointing)05d/%(filter)s/corr/BKGD-%(visit)07d-%(ccd)03d.fits
icSrc: %(pointing)05d/%(filter)s/output/ICSRC-%(visit)07d-%(ccd)03d.fits
src: %(pointing)05d/%(filter)s/output/SRC-%(visit)07d-%(ccd)03d.fits
srcMatch: %(pointing)05d/%(filter)s/output/SRCMATCH-%(visit)07d-%(ccd)03d.fits
srcMatchFull: %(pointing)05d/%(filter)s/output/SRCMATCHFULL-%(visit)07d-%(ccd)03d.fits
ossImage: %(pointing)05d/%(filter)s/thumbs/oss-%(visit)07d-%(ccd)03d.fits
flattenedImage: %(pointing)05d/%(filter)s/thumbs/flattened-%(visit)07d-%(ccd)03d.fits
wcs: jointcal-results/%(tract)04d/wcs-%(visit)07d-%(ccd)03d.fits
fcr: jointcal-results/%(tract)04d/fcr-%(visit)07d-%(ccd)03d.fits
brightObjectMask:
deepCoadd/BrightObjectMasks/%(tract)d/BrightObjectMask-%(tract)d-%(patch)s-%(filter)s.reg
(externally provided)
deepCoadd_tempExp:
deepCoadd/%(filter)s/%(tract)d/%(patch)s/warp-%(filter)s-%(tract)d-%(patch)s-%(visit)d.fits
deepCoadd_calexp:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/calexp-%(filter)s-%(tract)d-%(patch)s.fits
deepCoadd_det:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/det-%(filter)s-%(tract)d-%(patch)s.fits
deepCoadd_calexp_background:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/det_bkgd-%(filter)s-%(tract)d-%(patch)s.fits
deepCoadd_meas:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/meas-%(filter)s-%(tract)d-%(patch)s.fits
deepCoadd_measMatch:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/srcMatch-%(filter)s-%(tract)d-%(patch)s.fits
deepCoadd_measMatchFull:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/srcMatchFull-%(filter)s-%(tract)d-%(patch)s.fits
deepCoadd_mergeDet:
deepCoadd-results/merged/%(tract)d/%(patch)s/mergeDet-%(tract)d-%(patch)s.fits
deepCoadd_ref: deepCoadd-results/merged/%(tract)d/%(patch)s/ref-%(tract)d-%(patch)s.fits
deepCoadd_forced_src:
deepCoadd-results/%(filter)s/%(tract)d/%(patch)s/forced_src-%(filter)s-%(tract)d-%(patch)s.fits
6

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
forced_src: %(pointing)05d/%(filter)s/tract%(tract)d/FORCEDSRC-%(visit)07d-%(ccd)03d.fits

Back to top


4 ProcessingNotes
4.1 ProcessingoftheReleaseCandidate
A Release Candidate ("RC") dataset was defined and used in the test processing before the
fullprocessingstarted.
TheRCdatasetwasoriginallydefinedin
https://hsc-jira.astro.princeton.edu/jira/browse/
HSC-1361
for hscPipe 3.9.0. TheRCdatasetispublicandavailableat/datasets/. 62visitsof
them were not included in PDR1: two of SSP_WIDE and 60 of SSP_UDEEP_COSMOS. Those
data were obtained by
DM-10128
and their visit IDs are: 274 276 278 280 282 284 286 288
290 292 294 296 298 300 302 306 308 310 312 314 316 320 334 342 364 366 368 370 1236
1858 1860 1862 1878 9864 9890 11742 28354 28356 28358 28360 28362 28364 28366 28368
2837028372283742837628378283802838228384283862838828390283922839428396
28398284002840229352.
The RC dataset includes (1) 237 visits of SSP_UDEEP_COSMOS and (2) 83 visits of SSP_WIDE,
in 6 bands. Their data IDs are listed below.
1. Cosmos to full depth: (part of SSP_UDEEP_COSMOS)
(a) HSC-G
11690..11712:2^29324^29326^29336^29340^29350^29352
(b) HSC-R
1202..1220:2^23692^23694^23704^23706^23716^23718
(c) HSC-I
1228..1232:2^1236..1248:2^19658^19660^19662^19680^19682^19684^19694^19696
^19698^19708^19710^19712^30482..30504:2
(d) HSC-Y
7

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2
^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2
(e) HSC-Z
1166..1194:2^17900..17908:2^17926..17934:2^17944..17952:2^17962^28354..28402:2
(f) NB0921
23038..23056:2^23594..23606:2^24298..24310:2^25810..25816:2
2. Two tracts of wide: (part of SSP_WIDE)
(a) HSC-G
9852^9856^9860^9864^9868^9870^9888^9890^9898^9900^9904^9906^9912^11568^11572
^11576^11582^11588^11590^11596^11598
(b) HSC-R
11442^11446^11450^11470^11476^11478^11506^11508^11532^11534
(c) HSC-I
7300^7304^7308^7318^7322^7338^7340^7344^7348^7358^7360^7374^7384^7386^19468
^19470^19482^19484^19486
(d) HSC-Y
6478^6482^6486^6496^6498^6522^6524^6528^6532^6544^6546^6568^13152^13154
(e) HSC-Z
9708^9712^9716^9724^9726^9730^9732^9736^9740^9750^9752^9764^9772^9774^17738
^17740^17750^17752^17754
The w_2017_17 stackand meas_mosaicecfbc9d builtwith w_2017_17 wasusedintheRC epro-
cessing
(DM-10129).
In singleFrameDriver, there were eproducible failures in 46 ccds from
23 visits. The failed visit/ccds are the same as those in the w_2017_14 stack
(DM-10084).
Their
data IDs are:
--id visit=278 ccd=95 --id visit=280 ccd=22^69 --id visit=284 ccd=61
--id visit=1206 ccd=77 --id visit=6478 ccd=99 --id visit=6528 ccd=24^67
--id visit=7344 ccd=67 --id visit=9736 ccd=67 --id visit=9868 ccd=76
8

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
--id visit=17738 ccd=69 --id visit=17750 ccd=58 --id visit=19468 ccd=69
--id visit=24308 ccd=29 --id visit=28376 ccd=69 --id visit=28380 ccd=0
--id visit=28382 ccd=101 --id visit=28392 ccd=102 --id visit=28394 ccd=93
--id visit=28396 ccd=102 --id visit=28398 ccd=95^101
--id visit=28400 ccd=5^10^15^23^26^40^53^55^61^68^77^84^89^92^93^94^95^99^100^101^102
--id visit=29324 ccd=99 --id visit=29326 ccd=47
In WIDE, the coadd products have all 81 patches in both tracts (8766, 8767) in 5 filters, except
that there is no coadd in tract 8767 patch 1,8 in HSC-R (because nothing passed the PSF
qualityselectionthere);themultibandproductsofall162patchesaregenerated.
In COSMOS, the coadd products have 77 patches in tract 9813 in HSC-G, 74 in HSC-R, 79 in
HSC-I, 79 in HSC-Y, 79 in HSC-Z, and 76 in NB0921; the multiband products of 79 patches are
generated.
brightObjectMask were not applied in processing the RC dataset; but they should not affect
the results. forcedPhotCcd.py was not run in the RC processing.
4.2 SummaryofOutputs
All processing were done with the same stack setup. Data of the three layers (UDEEP, DEEP,
WIDE) were processed separately. The output epositories are archived at:
/datasets/hsc/repo/rerun/DM-10404/UDEEP/
/datasets/hsc/repo/rerun/DM-10404/DEEP/
/datasets/hsc/repo/rerun/DM-10404/WIDE/
All logs are at /datasets/hsc/repo/rerun/DM-10404/logs/.
While unnecessary, some edge tracts outside the PDR1 coverage were attempted in the pro-
cessing. Thosedataoutputsarekeptinthereposaswell. Inotherwords,therearemore
tractsintheaboveoutput epositoriesthanlistedinthetract IDsinTable2. Theadditional
data can be ignored.
9

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
4.3 ReproducibleFailures
In singleFrameDriver/processCcd, there were eproducible failures in 78 CCDs from 74 visits.
Their data IDs are:
--id visit=1206 ccd=77 --id visit=6342 ccd=11 --id visit=6478 ccd=99 --id visit=6528 ccd=24
--id visit=6528 ccd=67 --id visit=6542 ccd=96 --id visit=7344 ccd=67 --id visit=7356 ccd=96
--id visit=7372 ccd=29 --id visit=9736 ccd=67 --id visit=9748 ccd=96 --id visit=9838 ccd=101
--id visit=9868 ccd=76 --id visit=11414 ccd=66 --id visit=13166 ccd=20
--id visit=13178 ccd=91 --id visit=13198 ccd=84 --id visit=13288 ccd=84
--id visit=15096 ccd=47 --id visit=15096 ccd=54 --id visit=15206 ccd=100
--id visit=16064 ccd=101 --id visit=17670 ccd=24 --id visit=17672 ccd=24
--id visit=17692 ccd=8 --id visit=17736 ccd=63 --id visit=17738 ccd=69
--id visit=17750 ccd=58 --id visit=19468 ccd=69 --id visit=23680 ccd=77
--id visit=23798 ccd=76 --id visit=24308 ccd=29 --id visit=25894 ccd=68
--id visit=29324 ccd=99 --id visit=29326 ccd=47 --id visit=29936 ccd=66
--id visit=29942 ccd=96 --id visit=29966 ccd=103 --id visit=30004 ccd=95
--id visit=30704 ccd=101 --id visit=32506 ccd=8 --id visit=33862 ccd=8
--id visit=33890 ccd=61 --id visit=33934 ccd=95 --id visit=33964 ccd=101
--id visit=34332 ccd=61 --id visit=34334 ccd=61 --id visit=34412 ccd=78
--id visit=34634 ccd=61 --id visit=34636 ccd=61 --id visit=34928 ccd=61
--id visit=34930 ccd=61 --id visit=34934 ccd=101 --id visit=34936 ccd=50
--id visit=34938 ccd=95 --id visit=35852 ccd=8 --id visit=35862 ccd=61
--id visit=35916 ccd=50 --id visit=35932 ccd=95 --id visit=36640 ccd=68
--id visit=37342 ccd=78 --id visit=37538 ccd=100 --id visit=37590 ccd=85
--id visit=37988 ccd=33 --id visit=38316 ccd=11 --id visit=38328 ccd=91
--id visit=38494 ccd=6 --id visit=38494 ccd=54 --id visit=42454 ccd=24
--id visit=42510 ccd=77 --id visit=42546 ccd=93 --id visit=44060 ccd=31
--id visit=44090 ccd=27 --id visit=44090 ccd=103 --id visit=44094 ccd=101
--id visit=44162 ccd=61 --id visit=46892 ccd=64 --id visit=47004 ccd=101
Out of the 78 failures:
1. 36 failed with: "Unable to match sources"
10

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
2. 13 failed with: "No objects passed our cuts for consideration as psf stars"
3. 7 failed with: "No sources remaining in match list after magnitude limit cuts"
4. 3 failed with: "No input matches"
5. 3 failed with: "Unable to measure aperture correction for equired algorithm ’mod-
elfit_CModel_exp’: only 1 sources, but equire at least 2."
6. 1 failed with: "All matches rejected in iteration 2"
7. 15 failed with: "PSF star selector found [123] candidates"
In multiBandDriver, two patches of WIDE (tract=9934 patch=0,0 and tract=9938 patch=0,0)
failed with AssertionError as eported in
DM-10574.
We excluded the failed patches from
the multiBandDriver commands, and then jobs were able to complete and process all other
patches.DM-10574hasthenbeen
fixed.
The multiBandDriver job of WIDE tract=9457 could not finish unless patch=1,8 is excluded.
However tract 9457 is actually outside of the PDR1 coverage. In forcedPhotCcd, fatal errors
wereseenaboutthe eferenceofapatchdoesnotexist;thereforesomeforced_srcwerenot
generated. A JIRA ticket
DM-10755
has been filed but not fixed as of Aug 24 2017.
4.4 Low-levelProcessingDetails
This section includes low-level details that may only be of interest to the Operations team.
The first singleFrame job started on May 8, the last multiband job was May 22, and the last
forcedPhotCcd job was on Jun 1. The processing was done using the Verification Cluster and
the GPFS space mounted on it. The NCSA team was responsible of shepherding the run and
resolving non-pipeline issues, with close communications with and support from the DRP
team egardingthesciencepipelines. The ctrl_pool styledriverswererunontheslurm
cluster.
The processing tasks/drivers were run as a total of 8792 slurm jobs:
1. 514 singleFrame slurm jobs
11

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
2. 1555 mosaic slurm jobs
3. 1555coaddslurmjobs
4. 362 multiband slurm jobs
5. 4806 forcedPhotCcd slurm jobs
Their slurm job IDs can be found on the confluence page.
For single frame processing, every 11 visits (an arbitrary choice to divide work into a man-
ageablenumberofjobs)weregroupedintoonesingleFrameDriver.pycommand,therefore
5654/11 = 514 jobs in total, and submitted each job to one worker node. Data of the three
layers(DEEP,UDEEP,WIDE)werehandledseparatelybeginningwiththemosaicpipelinestep.
skymap.findTractPatchList wasusedtocheckthrougheachcalexp, findoutwhattract/patch
the ccd overlaps, and write into sqlite3. There are 1555 tract x filter combinations for all three
layers. For each tract x filter, all overlapping visits and a template were used to make a slurm
job file (similar to the .sl file as in
https://developer.lsst.io/services/verification.html#
verification-slurm).
Similarly for coadd making, each tract x filter was a slurm job, but jobs
were submitted using coaddDriver.py. The multiband processing jobs were submitted for
eachtract,usingmultiBandDriver.py.Allnumbershereincludedtractsthatwerenotactually
necessary (outside the PDR1 coverage). For forcedPhotCcd, the CmdLineTask command is
written into slurm job files for submission, similar to running mosaic. 21 visits (an arbitrary
choice) were grouped in each slurm job in the first batch of submissions; the rest had one
visitineachslurmjob. Inthiscampaign,atmost24coreswereusedononenodeatatime
and sometimes even fewer. We were aware that jobs were not run in the optimal way, and
they are to be improved in the coming cycles.
Ingeneral, whenjobsfailed, littleeffortwasspentintoinvestigationaslongasthe eruns
weresuccessful.Therewereafewtransienthardware/filesystemissues.Forexample,once
a known GPFS hiccup failed two jobs; we became aware because admins flagged an issue and
we happened to match up the timings within a few minutes. But issues like that could easily
happenwithoutbeingnoticed.Otherexamplesofothernon-science-pipelinefailuresareas
below.
Failures like the "black hole node" phenomenon were seen a few times. Sometimes many
jobs were queued in slurm, and next morning all jobs larger than a job ID were found to
12

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
be failed without any log being written. The appearance is that Slurm scheduled numerous
jobs in succession, one after another, to a faulty node with a GPFS problem, resulting in a
setoffailedjobs. Jobsthatstartedrunningbeforethatfailurepointwereabletocontinue
asnormal. Resubmissionsofthesamefailedjobswerealsogood. Theobservationofa
succession of jobs all going to the same problematic node and failing over and over again in a
shortamountoftimemotivatesanexaminationofthecontrollerconfiguration,astheremay
be Slurm settings that would distribute job and avoid the scenario.
Therewasaninstancethatseemedtobeabutlerreporacecondition. Whenrunningmo-
saicprocessing,multiplejobsseemedtobedoing IOwith repositoryCfg.yaml,andjobsfailed
atFile python/lsst/daf/persistence/posixStorage.py,line189,in putRepositoryCfg andthen
python/lsst/daf/persistence/safeFileIo.py, line 84, in FileForWriteOnceCompareSame. Multi-
ple files like " epositoryCfg.yamlGXfgIy" were left in the repo, and they are all the same. There
aretwopossiblewaystoavoidthis: (1)alwaysdoapre-run, or(2)donotletjobswriteinto
the same output repos or share disk.
Although large time limits were deliberately used in the slurm jobs, several jobs were timed
out and cancelled by slurm, mostly multiband jobs. For new runs, we chose to start over with
anewoutputrepo atherthanlettingthedriverreusetheexistingdata. Manualbutlerrepo
manipulationwasneededtocleanupbadexecutionsorcombineresults.
The pipe_drivers submission could take a few minutes to start each job.
For S17B, the job outputs were written to a production scratch space. The erun repos
were cleaned up and failures were resolved there. Then the repos were transferred to the
archivedspaceat /datasets.Fortransferring,ascripttodoparallelsyncingonaworkernode
was used; an example is on
https://wiki.ncsa.illinois.edu/display/~wglick/2013/11/01/
Parallel+Rsync.

Back to top


5 Resource Usage
5.1 DiskUsage
Figure1showsthediskusageintheproductionscratchspace,whichwas eservedpurelyfor
thisS17Bcampaignuse. Testsandfailedrunswrotetothisspaceaswell. Athour 275,re-
13

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
moval of some older data in this scratch space was performed so the drop should be ignored.
FIGURE 1: DiskUsageoftheproductionscratchspacethroughouttheS17B eprocessing
Theresultantdataproductsincluded11594219 filesintotal.Thelarge filesweretypicallyhun-
dreds of MBs. The average size was around 14 MB. The file size distribution is shown in Figure
3.Thedataproductswerearchivedin4foldersattheGPFSspace /datasets/hsc/repo/rerun/DM-10404/.
The number of files in each folder is shown in Figure 2. Below we summarize the output
epositoriesineachfolder.
1.
SFM
Overall there were 4664198 files. The smallest data products were calexpBackground
fits fileswhichalltookup128Keach. Thelargestdataproductswere calexp fits files
which took up 82M each. The average file size was 11.7 M. The number of files above
1M was 1358136 and the number of files with size less than 1M was 3307875.
2.
DEEP
Therewere829604 filesintotal. The2000largest filesrangedinsizefrom479Mto
177M, all of which existed within ./deepCoadd-results and were fits files. While the vast
majoritywere deepCoadd_meas fits filesassociatedwitha filterfolderwithindeepCoadd-
results,somewere deepCoadd_ref fits fileswhichwereassociatedwiththe ./deepCoadd-results/merged/
folder. The 2000 smallest files were all boost files of size 0B, all of which were contained
within this folder path 00991/HSC-Y/tract{number}/forcedPhotCcd-metadata/. Although
all of these files considered here were in the HSC-Y filter, there existed many other sim-
ilar fileswithintheother filtersaswell. The fileshadanaveragesizeof15.5M.There
were 196034 files above 1M and 705548 files below.
14

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
3.
UDEEP
There were 411557 files in total. The 2000 largest files here were similar to
the DEEP folder described above; most of the large files were deepCoadd_meas fits or
deepCoadd_calexp fits fileslocatedwithinthe filterfoldersof ./deepCoadd-results/,while
afewofthemwere deepCoadd_ref fits fileslocatedinthefolder ./deepCoadd-results/merged/.
The files ranged in size from 446M to 172M. The 2000 smallest files had a situation sim-
ilar to that described in the DEEP folder above: all of the smallest files were of size 0B,
wereboost files,andwerealllocatedin ./00814/HSC-Y/tract{number}/forcedPhotCcd-metadata
or other filters. The files had an average size of 16.7M. There were 95071 files above 1M
and316486 filesbelow.
4.
WIDE
Therewere5688860 filesintotal. Thisfolderbyfarhadthelargest files. Ofthe
2000 largest files, the largest was 1.1G and the smallest was 209M. The WIDE direc-
tory contained the only file over 1Gb. The files here were like the largest found in the
DEEP folder: most of them were deepCoadd_meas fits files located in the filter folders of
./deepCoadd-results/ while a few of them were deepCoadd_ref fits files located within
./deepCoadd-results/merged/. The 2000 smallest files were exactly like those found in
the UDEEP folder above. The files had an average size of 15.7M. There were 1417818
files above 1M and 4271042 files below.
FIGURE 2: The number of files in each output folder. Note that the y-axis here is on a log-
scale.
Figure4showthedistributionsforthedataproductsintermsofbutlerdatasettypes. All
15

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
FIGURE 3: file size distribution of the output repos. Note that the y-axis here is on a log-scale.
plots are in log scale.
5.2 CPUUsage
The total CPU used was 79246 core-hours, that is, around 471.7 core-weeks. The total User
CPUwas76246core-hours,thatis,around453.8core-weeks.
Thecore-hoursspentateachpipelinestepwere:
1. sfm: 19596.9
2. mosaic: 943.2
3. coadd: 5444.9
4. multiband: 34127.2
5. forcedPhotCcd:19133.9
16

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
FIGURE 4: Size distribution across butler types
17

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
The percentage is shown in Figure 5.
FIGURE 5: CPUtimeforeachpipeline
Figure 6 shows the "efficiency", calculated by dividing the total cpu time by wall elapsed time
* number of cores, for each pipeline.
1. A general feature of the plots is that the efficiency is observed to be bounded/limited by
the fact that with ctrl_pool/mpi the MPI root process was mostly idle and occupied one
core. This correlates with an upper bound for SFM of 23/24 0.958 , for coadd processing
of11/12 0.916,etc.
2. sfm: Every11visitsweregroupedintoonejob,andeachvisithad103ccds. Thus,1133
ccds were processed in a job, divided amongst 24 cores. Each ccd took around 2 min-
utes in average; in other words, roughly 90 min of wall clock elapsed time and 36 hr of
accumulatedCPUtimeperjob.Efficiencywasuniformlygood.SingleFrameDriverTask
is a ctrl_pool BatchParallelTask. The histogram below shows the CPU time of the SFM
slurmjobs. Thejob IDsofthelongestrunningjobsare: 51245,51320,51371,51483,
51496, 51497, 51525, 51533, 51534, 51536, 51546, 51547, 51548, 51549, 51550, 51582,
51587,51602,51603
3. mosaic: The unit of processing is each tract x filter on a node for each layer. Mosaic
jobs used 12 cores for reading source catalogs, via Python multiprocessing, but 1 core
18

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
FIGURE 6: Efficiency for each pipeline
19

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
FIGURE 7: CPUhoursoftheSFMpipeline
for other parts of the task; therefore we did not calculate the efficiency as it would be
misleading.MosaicTaskdidnotusectrl_pool.
4. coadd: coadd jobs were chosen to process a tract on a node. One tract has 9*9=81
patches.CoaddDriverTaskisactrl_poolBatchPoolTask. Inmostcasesthepatcheswere
processed “12 wide” using ctrl_pool, distributing the work to 12 cores on a node. Using
mpi based ctrl_pool in this context led to one mostly idle MPI root process and 11 work-
ers. AsVerificationnodeshave128GBRAM,thisgivesonaverage 11GBofmemory
per patch, with the aggregate able to use the 128 GB.
5. MultiBandDriver is a ctrl_pool BatchPoolTask. Six multiband jobs (9476-mbWIDE9219,
59482-mbWIDE9737,59484-mbWIDE10050,59485-mbWIDE10188,59486-mbWIDE16003,
59316-mbUDEEP8522) were excluded from this figure; their elapsed times were very
short and had very bad efficiencies but they were from tracks outside of the survey
coverage.
6. SomeoftheforcedPhotCcdjobsrunasonlyonetaskononenodehadveryhigheffi-
ciencybutthisgavebadthroughput.
7. Figure 8 show the histograms of the maximum resident set size and the virtual memory
sizeformosaicandforcedPhotCcd.Memorymonitoringofctrl_pooldriverjobs(single-
FrameDriver,coaddDriver,multiBandDriver)wasproblematicandwedonotbelievein
the numbers collected, so we do not plot them.
20

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
FIGURE 8: Memory usage of mosaic and forcedPhotCcd pipelines
21

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
5.3 NodeUtilization
Figure 9 shows the node utilization throughout the campaign. The Verification Cluster in
its optimal state has 48 compute nodes with 24 physical cores, 128 GB RAM on each node.
ForthedurationoftheS17B eprocessingtherewasapeakof45computenodesavailable.
The plot does not include failed jobs or test attempts, of which the generated data do not
contributetothe finalresultsdirectly.
FIGURE 9: hourly node usage

Back to top


6 TriggeredJIRAtickets
Some of the stack issues identified during the S17B Reprocessing have been turned into ac-
tionable JIRA tickets. Tickets for Science Pipelines improvements are
DM-10574
(Hit Asser-
22

LARGESYNOPTICSURVEYTELESCOPE
S17BHSCPDR1ReprocessingReport
DMTR-31
LatestRevision2017-08-28
tionError in deblender),
DM-10755
(forcedPhotCcd.py fails with the non-existing eference
of a barely-overlapping patch),
DM-10782
(Add bright star masks to ci_hsc), and
DM-10413
(Please complain louder if brightObjectMask cannot be found). Tickets for Middleware im-
provements are
DM-10761
(Failed CmdLineTask does not give nonzero exit code; resolved
by ),
DM-10624
(Duplicate log files from running pipe_drivers tasks), and
DM-11171
(Please
separatealgorithmicconfigsandoperationalconfigsinthetaskframework).
Many of the low-level processing issues are related to either the task framework design or
the Data Butler. Issues related to the framework design were not turned into JIRA tickets, but
concernswerebroughttotheSuperTaskWorkingGroupandButlerWorkingGroup.

Back to top


References
Aihara,H.,Armstrong,R.,Bickerton,S.,etal.,2017,ArXive-prints(arXiv:1702.08449
),ADSLink
23

Back to top