LargeSynopticSurveyTelescop
Early(pre-2013)Large-Scale
JacekBecla,K-TLim,DanielWang
DMTR-21
LatestRevision:2013-08-02
Abstract
ThistestreportcontainsQservlarge-scaleteststhat
searchanddevelopmentphase.Theformatisnon-standard
testsofdifferentaspectsofQservperformedoveranumber
wereperformedinearly2013orearlier.Manyofthese
mentedontheLSSTTrac,andversionsofLDM-135from2013andearlier.
LARGESYNOPTICSURVEYTELESCOPE
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Contents
1Introduction1
1.1Idealenvironment.........................
1.2Scheduleoftesting.........................
1.3Currentstatusoftests........................
2PhaseIIQservTesting(upto100nodes)3
2.1ir2farmtesting...........................
2.1.1Hardware..........................
2.1.2Initialtesting.........................
2.1.3Further38+1nodetestingwithr18694..................
2.1.438+1usingr19068........................
2.2Scalabilitytesting..........................
2.2.1Usingir2farm38+1configuration...................
2.3boertesting...........................
2.3.1Introduction.........................
2.3.2Hardware..........................
2.3.3Software...........................
2.3.4Configuration.........................
2.3.5InitialResults.........................
2.4Notes.............................
3150-nodeScalabilityTest18
3.1Hardware............................
3.2Data..............................
3.3Queries.............................
3.3.1Low-volume1–objectretrieval....................
3.3.2Low-volume2–timeseries.....................
3.3.3Low-volume3–spatially-restrictedfilter.................
3.3.4Highvolume1–count......................
ii
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
3.3.5High-volume2–full-skyfilter....................
3.3.6High-volume3–density......................
3.3.7Super-high-volume1–nearneighbor..................
3.3.8Super-high-volume2–sourcesnotnearobjects..............
3.4Scaling.............................
3.4.1Scalingwithsmallqueries.....................
3.4.2Scalingwithexpensivequeries....................
3.5Concurrency...........................
3.6Discussion............................
3.6.1Latency...........................
3.6.2Solid-statestorage........................
3.6.3Manycore..........................
3.6.4Alternatepartitioning.......................
3.6.5Distributedmanagement......................
4100-TBScalabilityTest(JHU20-nodecluster)31
5ConcurrencyTests(SLAC100,000chunk-queries)33
iii
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Early(pre-2013)Large-ScaleQserv
1Introduction
1.1Idealenvironment
Basedonthedetailedspreadsheetanalysis,weexpectthe
willbecomposedoffewhundreddatabaseservers[LDM-144],soarealistictestshould
cludeaclusterofatleast100nodes.
Totaldatabasesizeofasingledatareleasewillvaryfrom
1
Realistictestingrequiresatleast~20-30TBofstorage
Notethatlotofhighlyfocusedtestswhichareextremelyuseful
ofthesystemcanbedoneonaverysmall,2-3cluster,or
exampleofthatcanbemeasuringtheeffectoftablesizeon
join:thistypeofjoinwillbedonepersub-partition,and
rows),thusalmostalltestsinvolvingasinglesub-partition
withverylittlediskstorage.
Asignificantamountoftestingshouldbedonewherethedataset
memorysizebyanorderofmagnitude.Thistestingisimportant
manceinthepresenceofdiskperformancecharacteristics.
Itisessentialtohaveatleasttwodifferenttypesofcatalogs:
dataneedstobecorrelated,thatis,theobjectsshould
these2tableswillallowustomeasurespeedofjoins.It
ofsource-liketables(
DIASource
,
ForcedSource
) – the tests
Source
done
should
with be a good
approximation.
The most important characteristic of the test data is its
reflect realistic densities: presence of very crowded or
how data is partitioned and on performance of certain queries
inside one partition). Other than realistic spatial distribution,
1
These numbers are for single copy, data and indices, compressed when
1
LARGE SYNOPTIC SURVEY TELESCOPE
Early Qserv Tests
DMTR-21
Latest Revision 2013-08-02
valid (e.g., magnitudes) in order to try some queries with
These tests are not only used to prove our system is capable
but also as a mean to stress the system and uncover potential
In practice, whoever runs these tests should well understand
architecture system and turning MySQL.
1.2 Schedule of testing
•Selecting the base technology – Q2 2009
•Determining the architecture – Q3 2009
•Pre-alpha tests focused on parallel distributed query
scale (~20 nodes) – Q4 2009
•Most major features in (except shared scan, user tables),
cluster (~100 nodes) – Q4 2010
•Scalability and performance improvements and tests on
– Q4 2011
•Large scale tests, performance tests on a large cluster
•Shared scans – Q3 2013
•Fault tolerance – Q4 2013
1.3 Current status of tests
We have run several large scale tests
1.(10/2009) A test with the “pre-alpha” version of our software
using
Tuson
thecluster at LLNL (160 nodes, each node: two Intel
with 4 GB RAM and 1 local hard disk of 80 Gbs)
2.(2010) several 100-node tests run at SLAC. These tests
necks and prompted rewriting parts of our software, as
missing features in XRootD.
2
LARGE SYNOPTIC SURVEY TELESCOPE
Early Qserv Tests
DMTR-21
Latest Revision 2013-08-02
3.(4/2011) A 30 TB test on 150-node SLAC cluster using Qserv
urations, using 2 billion row Object and 32 billion row
set.
4.(12/2012) A 100-terabyte
Datascope
20-node
test on JHU’s
cluster. Due to high
of the cluster this test turned into testing resilience
administrative tools.
5.(05/2013) A 10,000-chunk test on a 120-node SLAC cluster
subtle issues with concurrency at scale.
6.(07/2013) A test on 300-node IN2P3 cluster to re-test
rency, and uncover unexpected issues and bottlenecks.
7.(08/2013) A demonstration of shared scans.
The tests 2-5 listed above are described in further details
DMTR-12
.
2 Phase II Qserv Testing (up to 100 nodes)
This section describes the earliest testing of Qserv, during
testing
USNO-B
thedata
, which
set was not representative of the shape
LSST data, although it represented real measurements on
2.1 ir2farm testing
2.1.1 Hardware
We have been granted use (since September 2010) of 40 nodes
ir2farm. Its main
BaBar
data
is to
processing,
support
and our access will
rary.
2.1.1.1 Node configuration
3
LARGE SYNOPTIC SURVEY TELESCOPE
Early Qserv Tests
DMTR-21
Latest Revision 2013-08-02
•SUN X4100 server
•CPU: Dual Opteron 275 processors (4 cores total), 2.2GHz,
•Memory: 4GB
•HD: 1 per machine for testing. binaries + data on same disk.
–
Disk details: 73.5GB FUJITSU MAY2073RCSUN72G 10k SAS
–
Fujitsu MAY2073RC series, 2.5”, 73.5GB, 8MB buffer,
–
Seek Time: 4 ms (average) / 8 ms (max)
–
Track-to-Track Seek Time 0.2 ms
–
Average Latency 2.99 ms
–
Spindle Speed 10000 rpm
–
Mean time before failure 1,400,000 hour(s)
–
Storage Hard Drive / Non-Recoverable Errors 1 per 10
15
–SeekErrors10per10
8
–1Gbitethernet,iperf:944MBits/secusingdefault
2.1.1.2OS/software/clusterconfiguration
•RHEL5.5Linux
•2.6.18serieskernel
•gcc4.1.2
•Python2.5.2(fromLSSTstack)
•MySQLserver5.1.45(defaulttuningsfromsourcedistrib)
•MySQLclient5.0.45,python1.2.2(LSSTstack)
•antlr2.7.7
•Twisted9.0.0
•boost1.42.0
4
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
•swig1.3.36+2(LSSTstack_)
•tcltk8.5a4
•scons2.0.1
•Developmentversionofxrootd/scalla(inc.changesnot
•Using39of40nodes,onehadafaultydiskattimeofinstallation,
redistributeddatasinceitgotfixed.
•1Master,38worker.
•Master:xrootd/cmsdinmanagermode,qserv-master,MySQL
datadb,mysql-proxy
•Worker:xrootd/cmsdinservermode,qserv-workeras
dbs.
2.1.1.3DataconfigurationUsingduplicatedPT1data.
•PT1Objectdata,33copiesinRA,17copiesindeclination
#!/bin/sh
fpath=orig
opath=.
tname=Object
#totalisseq0140
foriin`seq09`;do
beg=`expr$i\*4`
end=`expr$beg+4`
echo"beg$begend$end"
pythonduplicatePT1.py--schema$fpath/$tname.sql
--delim=,\
--raCopies=33--declCopies=17\
--start=$beg--end=$end\
--box=-5.25,-4.4,5.25,4.75\
5
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
--outPrefix=$opath/$tname\
$fpath/$tname.txt
done
•Partitionedusingfixedpartitions.60stripes,18substripes.
./lsstpy.shpartition.py-v-t2-p4-S60-s18-PObject
•2001chunkstotal.
2.1.2Initialtesting
PerformedusingtrunkSVNr17580
•full-skyrowcount:
SELECTCOUNT(*)FROMObject;
–1m50s
•Columnquery,full-sky:
SELECTCOUNT(*)FROMLSST.ObjectWHEREgFlux_PS>3000;
–25min3.32sec
•1sqdegreecountquery:
SELECTCOUNT(*)FROMLSST.ObjectWHEREqserv_areaspec_box(6,6,7,7)
ra_PSBETWEEN6AND7ANDdecl_PSBETWEEN6AND7;
–13.42sec
•Fulltablescanaggregation:
SELECTAVG(ra_PS)FROMObject;
–25m2s
6
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
–Effectivebandwidth:
*1+38nodes
*1.47Brows*1015bytes/row=~1.5Billionbytes
*Bandwidth:995MB/sec,or26MB/secforeachdisk(raw
•Spatially-restrictedselection(smaller):
selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(5,5,6,6);
–(count(*)=0)
–25min5.09sec
–Usingqservr18134withspatialUDF
–withbox:6.5,6.5,7.1,7.1:11.13sec,count=17151,
•Spatially-restrictedselection(bigger):
selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(1,1,6,6);
–(count(*)=590978
–25min6.52sec
–Usingr18134withspatialUDFprocessing
–1994chunkstouched.
•Spatially-restrictedselection(small:0.25sqdeg):
selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(5.5,5.5,6,6);
–(count(*)=0
–12.52sec
–Usingr18134withspatialUDFprocessing
–2chunkstouched.
•Near-neighborover9sqdeg
selectcount(*)fromLSST.Objecto1,LSST.Objecto2
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,
–(count(*)=395606096)
–6m18.32s;6m7.63s
–4chunkstouched.
7
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
2.1.3Further38+1nodetestingwithr18694
•Simplecountquery
mysql>selectcount(*)fromObject;
Times:9:56.84,45.42,44.61~10minuteexecutionfor
theframeworkincursoverheadindeterminingchunk-node
themforfutureaccess.
•Near-neighborover100sqdegrees
mysql>selectcount(*)fromLSST.Objecto1,LSST.Object
whereqserv_areaspec_box(-11,-11,-1,-1)
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,
+------------+
|count(*)|
+------------+
|8261246957|
+------------+
1rowinset(41min54.28sec)
Note:thereare4,373,646rowsinthisbox,spreadover
•Concurrentnearneighborqueries.2x100sqdeg
mysql>selectcount(*)fromLSST.Objecto1,LSST.Object
whereqserv_areaspec_box(-11.1,-11.1,-1.1,-1.1)
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
+------------+
|count(*)|
+------------+
|8136886173|
+------------+
1rowinset(51min8.50sec)
mysql>selectcount(*)fromLSST.Objecto1,LSST.Object
whereqserv_areaspec_box(2,-11,12,-1)
8
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
+------------+
|count(*)|
+------------+
|7763722919|
+------------+
1rowinset(36min41.26sec)
•COUNT(*)isrunningabout30swhenthingsarecached.
2000activechunks,thisisabout66chunkqueriesper
lotoflogisticalsteps:querygeneration,nodelookup
(network),open-readresults-close(network),write
dumpfile,writedumpfile,readdumpfileintomysql,merge
table.Thisisadditiontotheactualsubqueryexecution.
faster.Mightbetimeforsomemicrobenchmarks.
2.1.438+1usingr19068
•COUNT(*)isnowrunningwith~12slatency.Shouldbe
bufferingresultsonfrontend(necessarytohandlelarge
2.2Scalabilitytesting
2.2.1Usingir2farm38+1configuration
2.2.1.1Testharnessat5nodesTestedwithonly5nodes,samedatadensity
mately)as38nodecfg.Testoutsamplequeries.Actualbenchmark
numbersinthewhereclausetopreventqueryresultcaching.
•Stream1:
mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
+------------------+---------------------+
|avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
9
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
+------------------+---------------------+
|3538.14419765824|120.888001608003|
+------------------+---------------------+
1rowinset(29min9.21sec)
mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
+------------------+---------------------+
|avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
+------------------+---------------------+
|2431.19667946996|98.6096359211597|
+------------------+---------------------+
1rowinset(25min41.61sec)
•Stream2:
mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
+------------------+---------------------+
|avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
+------------------+---------------------+
|3538.14419765824|120.888001608003|
+------------------+---------------------+
1rowinset(18min32.53sec)
mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
+------------------+---------------------+
|avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
+------------------+---------------------+
|4003.46130769641|129.511811500079|
+------------------+---------------------+
1rowinset(29min22.81sec)
•Singlequery:
mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
+------------------+---------------------+
|avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
+------------------+---------------------+
10
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
|3933.44004534094|128.243156312967|
+------------------+---------------------+
1rowinset(14min34.65sec)
Thesingle(non-concurrent)queryisfaster,asexpected.
unlessscansareshared,twosimultaneousfull-scans
aslong.
2.3boertesting
2.3.1Introduction
During12/2010,wehaveobtainedtemporaryaccesstothe
accesstoboer0021-boer0122(whereboer0042isadeadnode),
2.3.2Hardware
Theboerclusterconsistsof
•SUNX2200M2servers
•CPU:DualOpteron2218processors(4corestotal),2.613GHz,
•Memory:8GB
•Systemdisk:binaries
–SATA3.0GbpsconfiguredasUDMA/133
–SeagateST32500N
–SEAGATEST32500NSSUN250G0732B4ES5H,3.AZK,maxUDMA/133
–250GB
•Datadisk:mysqltables
–HitachiHDS7225SRev:V5DO
11
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
–HDS7225SCSUN250G
–250GB
Otherconfiguration/settingsseemtobethesameasforir2farm.
2.3.3Software
Wehaveconfiguredthesoftwareidenticallyasforir2farm,
datedqservsoftware.
2.3.4Configuration
Wehavesetuptheclusterwithone“headnode”and100“worker”
runs:mysqlproxy,mysqld,qservfrontend,xrootd/cmsd(manager),
Eachworkernoderuns:mysqld,xrootd/cmsd(server+libqserv_worker).
2.3.5InitialResults
usingtrunkr18510
•Full-skyobjectcount:
selectcount(*)fromObject;
Exampleresult:
+------------+
|count(*)|
+------------+
|1546902522|
+------------+
1rowinset(3min18.60sec)
Times:3m18.60s,20.48s,20.47s,20.43s
12
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Notethefirstexecutionofthequeryisslow,sinceit
manager.Afterthemanagerrestarts,itmustrebuild
pings.Eachmappingisresolvedviaabroadcastmessage
achunk.Sincewearerunninginmanager+supervisor
takesoneadditionalnetworkhopforeachlookupresolution.
tionalhopisincurredforconfigurationsbetween65-4096
•1.2x1.2degreesquareselection
mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(0,0,1.2,1.2);
+----------+
|count(*)|
+----------+
|27508|
+----------+
1rowinset(1.38sec)
•1sqdegselection
mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1,-1);
+----------+
|count(*)|
+----------+
|47336|
+----------+
1rowinset(2.84sec)
•1/4sqdegselection
mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.5,-1.5);
+----------+
|count(*)|
+----------+
|16285|
+----------+
1rowinset(1.44sec)
•0.04sqdegselection
13
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.8,-1.8);
+----------+
|count(*)|
+----------+
|2754|
+----------+
1rowinset(1.42sec)
•column-bounded0.01sqdegselection
mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.9,-1.9)
+----------+
|count(*)|
+----------+
|0|
+----------+
1rowinset(1.43sec)
NotethatuFlux_PSisNULLinthispatchoftheskyfor
rFlux_PSforthem.
mysql>selectmax(uFlux_PS)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.9,-1.9);
+---------------+
|max(uFlux_PS)|
+---------------+
|NULL|
+---------------+
1rowinset(1.42sec)
•full-skyaverage:
mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
+------------------+
|avg(uFlux_PS)|
+------------------+
|3526.44353629347|
+------------------+
1rowinset(19min11.33sec)
14
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
2.3.5.1Usingr18512Switchedtosynchronousdispatchtoreduce
mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
+------------------+
|avg(uFlux_PS)|
+------------------+
|3526.44353629347|
+------------------+
1rowinset(14min11.17sec)
2.3.5.2Usingr18530
Thefrontendcodewaschangedtoreducethreadcontention.
aqueryisnowserial,andresultprocessingislimited
file-handleandthreadcrowding.Throughputincreases,
fectsaremoreseveresincethechunklookupsarenowserial
patching).
Intheresultsbelow,thecoldstarteffectsarestrongly
However,oncethelookupsarecached,theresultingexecution
mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
+------------------+
|avg(uFlux_PS)|
+------------------+
|3526.44353629347|
+------------------+
1rowinset(22min50.89sec)
mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
+------------------+
|avg(uFlux_PS)|
+------------------+
15
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
|3526.44353629347|
+------------------+
1rowinset(10min35.80sec)
mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
+------------------+
|avg(uFlux_PS)|
+------------------+
|3526.44353629347|
+------------------+
1rowinset(10min35.47sec)
Wearetestingatoolthatwarmsupthelookupcachebypre-requesting
advanceofrealqueries.Thebottleneckisalargechunk(28GB
whichtakes150+secondsafterallotherchunkshavecompleted.
sibleforincreasingexectimebeyond8minutes.Below7
goodbalancing.MySQLtuningshouldimprovethingsfurther,
showsanunderutilizationofdiskbandwidth.
•Fullskyobjectcount
SELECTCOUNT(*)FROMObject;
18m10.62s;9.31s;19.53s;20.31s
•Column-restrictedfull-skycount
SELECTCOUNT(*)FROMLSST.ObjectWHEREgFlux_PS>3000;
10min35.09sec
•Full-skyaverageaggregation
SELECTAVG(ra_PS)FROMObject;
10m35.05s
16
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
•Small1sqdegselection
selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(-6,-6,-5,-5);
8.41sec
•Larger25sqdegselection
selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(-5,-5,0,0);
8.1sec
selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(1,1,6,6);
7.61sec
•Smallnear-neighborselection9sqdeg
selectcount(*)fromLSST.Objecto1,LSST.Objecto2
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
4m32.50s;4min16.51sec
•Mediumnear-neighborselection100sqdeg(2concurrent
selectcount(*)fromLSST.Objecto1,LSST.Objecto2
whereqserv_areaspec_box(-11,-11,-1,-1)
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
selectcount(*)fromLSST.Objecto1,LSST.Objecto2
whereqserv_areaspec_box(2,-11,12,1)
andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
Thesequeriesfailedtocompleteduetoamisconfiguration
therewasnotenoughtimetofixandre-run.Theywereexpected
proportionaltotheircoveredarea,andtonotimpede
theyshouldbeparallelizedamongdifferentnodes.
17
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
2.4Notes
•2010/12/15:Wedonot(currently)support(properly)
SELECTSUM(A)FROMFooGROUPBYB;
sincethequeryrewriterdoesnotaddBtotherequested
ThusthefinalmergedaggregationfailsbecausenoBexists
•2010/11/30:Noticethataspatially-restrictedselection
scan.Thisimpliesthatittakesabout25minutestoscan
indexesonRAandDecl,butwithUDFsperformingspatial
beenunabletoleveragethem.Thisarguesthatweshould
–Approach1:subChunkIdclauseinjection–subChunkIdISindexed,soweshould
addsomethinglike“subChunkIdIN(1,2,3,5,25,100)”
drasticallyimproveperformance,butismorecomplex.
–Approach2:Addra/declindexes–Thismightbedesired
addra/declclausesthatMySQLunderstandssothat
3150-nodeScalabilityTest
3.1Hardware
Weconfiguredaclusterof150nodesinterconnectedviagigabit
quad-coreIntelXeonX5355processorswith16GBmemoryand
disk.TestswereconductedwithQservSVNr21589,MySQL5.1.45
patches.
3.2Data
Wetestedusingadatasetsynthesizedbyspatiallyreplicating
challenge(“PT1.1”,Document-26217).Weusedtwotables:ObjectandSource.
2Theschemamaybebrowsedonlineat
https://ls.st/8g4
18
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
tablesareamongthelargestexpectedinLSST.Ofthesetwo,
bethemostfrequentlyused.TheSourcetablewillhave50-200X
anditsuseisprimarilyconfinedtotimeseriesanalyses
Objecttable.
ThePT1.1datasetcoversasphericalpatchwithright-ascension
declinationbetween-7˚and7˚.Thispatchwastreatedas
catedovertheskybytransformingduplicaterows’RAand
tomaintainspatialdistanceanddensitybyanon-linear
asafunctionofdeclination.ThisresultedinanObject
Sourcetableof55billionrows(30TB).
3TheSourcetableincludedonlydata
and+54˚indeclination.Thepolarportionswereclipped
testcluster.Partitioningwassetfor85stripeseachwith??heightof
~2.11˚forstripesand0.176˚forsub-stripes.Eachchunk
mately4.5deg
2,andeachsub-chunk,0.031deg2.Thisyielded8,983chunks.Overlap
to0.01667˚(1arc-minute).
3.3Queries
ThecurrentQservdevelopmentfocusisonfeaturesforscalability.
oftestqueriesthatdemonstrateperformanceforbothcheap
andexpensivequeries(hour,daylatency).Runsoflowvolume
queries,whilerunsofhighvolumequeriesandsuperhigh
afeworevenonequeryduetotheirexpense.Allreported
command-lineMySQLclient(MySQL).
3.3.1Low-volume1–objectretrieval
SELECT*FROMObjectWHEREobjectId=<objId>
InFigure1wecanseethatperformanceofthisqueryisroughly
seconds.Eachrunconsistedof20queries.Theslowerperformance
eachexecutiontook9seconds,wereprobablytheresult
Weattributetheinitial8secondexecutiontimeinRun5and
3Sourcefortheduplicatorisavailableat
https://github.com/lsst/qserv/blob/master/admin/python/lsst/
qserv/admin/dataDuplicator.py
19
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
(likelytheobjectIdindex)inthecluster.
Figure1:Low-volumeobjectretrival.
3.3.2Low-volume2–timeseries
SELECTtaiMidPoint,fluxToAbMag(psfFlux),fluxToA
FROMSource
WHEREobjectId=<objId>
Thisqueryretrievesinformationfromalldetectionsof
tivelyprovidingatime-seriesofmeasurementsonadesired
wasrandomizedasfortheLowVolume1query,whichmeant
wheretheSourcedatawasmissingduetoavailablespace
InFigure2weseethatperformanceisroughlyconstantatabout
wasdoneafterLowVolume1’sRun1andwediscountits9second
asanomalous.
20
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Figure2:Low-volumetimeseries.
3.3.3Low-volume3–spatially-restrictedfilter
SELECTCOUNT(*)
FROMObject
WHEREra_PSBETWEEN1AND2
ANDdecl_PSBETWEEN3AND4
ANDfluxToAbMag(zFlux_PS)BETWEEN21AND21.5
ANDfluxToAbMag(gFlux_PS)−fluxToAbMag(rFlux_PS)BETWEEN0.A3ND0.4
ANDfluxToAbMag(iFlux_PS)−fluxToAbMag(zFlux_PS)BETWEEN0.A1ND0.12;
InFigure3weseethesame4secondperformancethatwasseenfor
queries.Again,the~9secondperformanceinRun2couldnot
itasresultingfromcompetingprocessesonthecluster.
21
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Figure3:Low-volumespatially-restrictedfilter.
3.3.4Highvolume1–count
SELECTCOUNT(*F)ROMObject
Figure4:Highvolumecount.
3.3.5High-volume2–full-skyfilter
SELECTobjectId,ra_PS,decl_PS,uFlux_PS,gFlux_P
rFlux_PS,iFlux_PS,zFlux_PS,yFlux_PS
FROMObject
WHEREfluxToAbMag(iFlux_PS)−fluxToAbMag(zFlux_PS)>4
22
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Usingtheon-diskdatafootprint(MySQL’sMyISAM.MYD,
theObjecttable(1.824x10
12bytes),wecancomputetheaggregateeffective
bandwidth.Run3’s7minuteexecutionyields4.0GB/sin
whiletheotherrunsyieldapproximately11GB/sinaggregate,
eachnodewasconfiguredtoexecuteupto4queriesinparallel,
realistic,givenseekactivityfromcompetingqueries
theoreticaltransferrateof98MB/s.
Figure5:Highvolumefull-skyfilter.
3.3.6High-volume3–density
SELECTCOUNT(*A)Sn,AVG(ra_PS),AVG(decl_PS),chunkId
FROMObject
GROUPBYchunkId
23
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Figure6:Highvolumefull-skyfilter.
Thisquerycomputesstatisticsfortablefragments(which
givingaroughestimateofobjectdensityoverthesky.It
tionquerysupportinQserv.Thisqueryisofsimilarcomplexity6
illustratesmeasuredtimessignificantlyfaster,whichis
missiontime.AsmentionedforHV2,cachebehaviorwasnot
inRun3maybeclose.
3.3.7Super-high-volume1–nearneighbor
SELECTCOUNT(*)
FROMObjecto1,Objecto2
WHEREqserv_areaspec_box(−5−,5,5,−5)
ANDqserv_angSep(o1.ra_PS,o1.decl_PS,
o2.ra_PS,o2.decl_PS)<0.1
Thisqueryfindspairsofobjectswithinaspecifiedspherical
ticularpartofthesky.Overtworandomlyselected100deg
2areas,theexecutiontimes
about10minutes(667.19secondsand660.25seconds).The
between3to5billion.Sinceexecutionuseson-the-flygenerated
fitinmemory,andQservdoesnotyetimplementcaching,we
negligible.
3.3.8Super-high-volume2–sourcesnotnearobjects
24
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
SELECTo.objectId,s.sourceId,s.ra,s.decl,
o.ra_PS,o.decl_PS
FROMObjecto,Sources
WHEREqserv_areaspec_box(224.1,−7.5,237.1,5.5)
ANDo.objectId=s.objectId
ANDqserv_angSep(s.ra,s.decl,o.ra_PS,o.decl_P
Thisisanexpensivequery–an??(????)joinover150squaredegreesbetweena
30TBtable.EachobjectIdisuniqueinObject,butisshared
so??∼41.Werecordedtimesofafewhours(5:20:38.00,2:06:56.33,
varianceispresumedtobecausedbyvaryingspatialobject
areasselected.
3.4Scaling
WetestedQserv’sscalabilitybymeasuringitsperformance
inthecluster.Tosimulatedifferentclustersizes,thefrontend
queriesforpartitionsbelongingtothedesiredsetofcluster
datasizeproportionallywithoutchangingthedatasize
performanceat40,100,and150nodestodemonstrateweak
3.4.1Scalingwithsmallqueries
FromFigures7,8,and9,weseethatexecutiontimeisunaffectedby
thedatapernodeisconstant.Thespikeinthe40-nodeconfiguration9iscausedby2slow
queries(23sand57s);theother28executedintimesranging
25
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Figure7:Scalingwithnodecount(1).
Figure8:Scalingwithnodecount(2).
Figure9:Scalingwithnodecount(3).
3.4.2Scalingwithexpensivequeries
HighVolume
26
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
IfQservscaledperfectlylinearly,theexecutiontime
nodeisconstant.InFigure10thetimesforhighvolumequeriesshowaslight
isaprimarilyatestofdispatchandresultcollectionoverhead
withthenumberofchunkssincethefront-endhasafixedamount
Sincewevariedthesetofchunksinordertovarythecluster
shouldthusvarylinearlywithclustersize.HV3seemsto
cacheeffects–itsresultwascachedsoexecutionbecame
TheHighVolume2queryapproximatelyexhibitstheflatbehavior
scalability.Cachingeffectsmayhavecloudedtheresults,
queryresultswereperfectlycached,weexpecttheoverall
byoverheadasinHV1,andthisisclearlynotthecase.
Figure10:Scalingwithhighvolumequeries.
SuperHighVolume
Thetestsonexpensivequeriesdidnotshowperfectscalability,
surementsdidshowsomeamountofparallelism.Itisunclear
configurationwastheslowestforbothSHV1andSHV2.Ourtime-limited
didnotallowustorepeatexecutionsoftheseexpensivequeries
inbetterdetail.
27
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Figure11:Scalingwithsuperhighvolumequeries.
3.5Concurrency
Figure12:Concurrencytest.
WewereabletotestQservwithmultiplequeriesinflight.
parallelinvocationsofHV2,oneofLV1,andoneofLV2.Each
1secondbetweenqueries.Figure12illustratesconcurr
HV2queriestakeabouttwicethetime(5:53.75and5:53.71)
Thismakessensesinceeachisafulltablescanthatiscompeting
scanninghasnotbeenimplemented.Thefirstqueriesinthe
inabout30seconds,buteachoftheirsecondqueriesseems
queriesinthestreamsfinishfaster.Sincetheworkernodes
forqueriesanddonotimplementanyconceptofquerycost,
system.Theslownessoflowvolumequeriesafterthesecond
28
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
glance,sincetheyshouldbequeuedattheendontheirassigned
completeneartheendoftheHV2queries.Inthatcase,subsequent
workerswithnearlyemptyqueuesandexecuteimmediately.
byqueryskew–shortqueriesmaylandonworkersthathave
onthehighvolumequeries.
3.6Discussion
3.6.1Latency
LSST’sdataaccessneedsincludesupportingbothsmall,
longer,hour/day-scalequeries.WedesignedQservtooperate
avoidneedingmultiplesystems,whichwouldbecostlyin
hardware.Indexingwasimplementedinordertoreducelatency
touchasmallpartofthedata.
ThecurrentQservimplementationincurssignificantoverhead
lectingresults.Inearlydevelopmentwedecidedtominimize
sothefront-endmasterbecameresponsibleforpreparing
didnotneedtoperformparsingorvariablesubstitution.
heavyweightaswell.MySQLdoesnotprovideamethodto
instances,sotablesaredumpedtoSQLstatementsusingmysqldumpandreloadedonthe
front-end.Thismethodwaschosentospeedprototyping,
work,anddatabasetransactionsarestrongmotivations
3.6.2Solid-statestorage
SomeofQserv’sdesignchoices(e.g.,sharedscanning)
aroundpoorseekperformancecharacteristicsofdisks.
apracticalalternativetomechanicaldiskinmanyapplications.
indexes,itscurrentcostdifferentialperunitcapacity
bulkdata.Inthecaseofflashstorage,themostpopularsolid-state
scanningisstilleffectiveinoptimizingperformancesince
storageandflashstillhas“seek”penaltycharacteristics
disk).
29
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
3.6.3Manycore
WeexpecttheperformancetobeI/Oconstrained,sincethe
formancelimited.Itisunlikelythatmanycorescanbeleveraged
willbesizedwithonlythenumberofdiskspindlesthatsaturate
scanningshouldincreaseCPUutilizationefficiency.
3.6.4Alternatepartitioning
Therectangularfragmentationinrightascensionanddeclination,
alizephysicallyforhumans,isproblematicduetosevere
exploringtheuseofahierarchicalscheme,suchasthehierarchical6]for
partitioningandspatialindexing.Theseschemescanproduce
area,andmapsphericalpointstointegeridentifiersencoding
subdivisionlevels.Interactivequerieswithverysmall
operateoverasmallsetoffinepartitionIDs.Ifchunksare
mayallowI/Otooccuratbelowsub-chunkgranularitywithout
otherbonusisthatmature,welltested,andhigh-performance
computingthepartitionIDsofpointsandmappingspherical
3.6.5Distributedmanagement
TheQservsystemisimplementedasasinglemasterwithmany
reasonableandhasperformedadequatelyintesting,but
instanceatLSST’splannedscalemayhaveamillionfragment
haveplanstooptimizethequerymanagementcodepath,managing
pointislikelytobeproblematic.Thetestdatasetdescribed
about9,000chunks,whichmeansthatalaunchofeventhemost
about9,000chunkqueries.
Onewaytodistributethemanagementloadistolaunchmultiple
issimpleandrequiresnocodechangesotherthansomelogic
balancebetweendifferentQservmasters.Anotherwayis
management.Insteadofmanagingindividualchunkqueries,
groupsofthemtolower-levelmasterswhichwouldcouldeither
30
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
groupsormanagetheindividualchunkqueriesthemselves.
4100-TBScalabilityTest(JHU20-nodecluster)
Inthefallof2012,wewereprovidedtheopportunitytouse
HopkinsUniversity(JHU),thatwerelargememory,multi-processor
largestorageattached,tosetupasaQservservice.There
thenodeshadtwoprocessorswith12coreseachofIntelXeon
mountedasadatavolume,were22TBraidarrays,toprovide
Thiswouldprovideahighvolumestoragetest,butwitha
withonemasternode,and20workernodes.
Thedatausedwouldbeproducedoneachnode,startingwith
testdatasetwas220GB,buthighdensitydatainoneparticular
wide.Wechoseahighdensityspotofthisdata,andthen
sphere,toprovideahighdensitydatasetoverthewhole
data.
Theproductionofthislargeamountofdataprovedtohave
ratherslow,takingmanyweeksforafullproductiononeach
weresetup,andthestabilityoftheclusterwasanissue,
running,andmanysmallerproductionprocessesweresetup
producedover100TBofcsvtextfiles,intoabout7000chunks
done,thenthisdatawasloadedintoMySQLMyISAMtables,
alsotookdays.
Overthecourseofthistime,oftennodeswouldgooutwith
amountoftime,beforecomingback.Oftenthiswouldbewith
volumes,butthelossofcomputingnodeswouldsetbackthe
complete.Butthiswasasmallproblem,thantheproblemof
wereupandrunning,thedataservicewasstillhavingproblems.
lossorcorruptionofafewofthethousandsoftablesonthe
eitherdatawouldbere-createdorjustblockedfortesting.
withdatacorruptionwasconstantissue,butstillalarge
least.Alsoaproblemwasfilecorruptionontheinstallsoftware,
31
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
bereinstalledoverthecourseofthetesting.Afull100TB
about85TBcouldeverbeserved,beforetheresourceshad
Butsometestingwasabletogetdoneonthiscluster.Afull
Objectdata,althoughthiswasonlyontheaprox,85TBof
100TBofdatathatwasgenerated.Thequerywasperformed:
SELECTcount(*F)ROMObject
+------------+
|count(*)|
+------------+
|2059335968|
+------------+
1rowinset(19.26sec)
Showing2Bobjectsinthetable.Theresultherewasafter
times,andthecachehadbeenstabilized.Thisissimilar
testing.
Thelowvolumetestfromaccesstoasmallportionofthesky
query:
SELECTcount(*)
FROMObject
WHEREqserv_areaspec_box(1,3,2,4)
ANDscisql_fluxToAbMag(zFlux_PS)BETWEEN21AND21.5
+----------+
|count(*)|
+----------+
|748|
+----------+
1rowinset(4.45sec)
32
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Thistimeforaccessisalsosimilartoprevioustesting,
smallpartoftheskyofacertaincolor.The~4.5sec.overhead
fordataaccessinthisversionoftheqservsoftware.
Ahighvolumedatatestwasperformed,lookingforcolor
certainrangeofcolor.Thiswillscanoverallobjects,to
testquerywas:
SELECTobjectId,ra_PS,decl_PS,uFlux_PS,gFlux_P
rFlux_PS,iFlux_PS,zFlux_PS,yFlux_PS
FROMObject
WHEREscisql_fluxToAbMag(iFlux_PS)−
scisql_fluxToAbMag(zFlux_PS)>4
Thisqueryreturned15695recordsin6min33.50sec.Again
beroftimes,andthistimeistheaveragetimeafterthecaches
performedagain,thistimelookingatalowernumberofrecords,
betweeniandzfluxof5thistime.Thisqueryreturned2967records,
timewasalittlelowerthistime,whichwasmostlythetime
wheretherestofthetimewastheover-headinscanningthe
theserecords.Theprevioustestsweredoneon30TBofdata,
thesenodeshadmanylesscores.Butthistestwouldreturn
hereitisabout375sec.Theextratimeherewillcomefrom
datapernode,andamountofdataingeneral,andtheaccess
5ConcurrencyTests(SLAC100,000chunk-queries)
ApreviousversionoftheQservmastercodehaddedicated
eachquery,onefordispatchingchunkqueries,andtheother
dispatchpoolwassizedataquitehigh500threadsfortwo
dispatchworkasquicklyaspossible,allowingtheQserv
best.Secondly,thefirstquerydispatchagainstachunk
latencyonafulltablescanof~10,000chunkstakesapproximately
patchthreads.Subsequently,XRootDcachingallowsfor
comparison.
33
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
Thethreadpoolforresultreadswasgivenamuchsmallersize:
isbecausetheQservmasterprocesscanonlyexploitlimited
mergingworkerresultsforasinglequery.Infact,themain
as20threadsisthatitreducesresultmergelatencywhen
henceresultavailabilitytimes)areskewed.
Anunfortunateconsequenceofthissimpledesignwasthat
querieswouldcausethreadcreationfailuresintheQserv
changedtoaunifiedquerydispatchandresultreaderthread
Totestourabilitytohandlemanyconcurrentfull-table
threads,wepartitionedthePT1.2[Document-9044]Objecttableinto~8,000chunks,
tributedthemacrossan80nodeclusteratSLAC.Thenodes
hadlimitedquantitiesofRAM,makingthemtheperfectworkhorses
particular,askingformorethan~1000threadswouldcause
form.Usingthenewunifiedthread-pooldesignwewereable
and12concurrentObjecttablescanseachinvolving~8,000
executiontimeof2to8minutes,thusdemonstratingthat
of~100,000in-flightchunkqueries,evenonveryoldhardware.
Notethatusingaunifiedthreadpoolforresultreadsrequires
starvation.Asinglequerycaneasilyrequire10,000result
totalthreadsinthepool.Asaresult,wemustbecareful
singlequery,orqueriesthatshouldbeinteractivecaneasily
astheywaitforatablescantofinish.Ourunifiedthread
assignsavailablethreadstothequeryusingthefewest
createnewthreadswhenanewqueryisencountered(upto
Totestthis,wesetupQservwithasinglemasternodeand
wasconfiguredwith~12,000emptychunktables.Wethensubmitted
(SELECTCOUNT(*)FROMObject),andaninteractivequery(SELECTCOUNT(*)FROMObject
objectIdIN(1))requiringjustasinglechunkquerytoanswer.
demonstratethatthemasterimmediatelyallocatedthreads
chunkqueryandreadbackitsresults,theexecutiontime
higherthanitshouldhavebeen.Itturnsoutthisisbecause
chunkqueryschedulingpolicy,andthesinglechunkquery
34
LARGESYNOPTICSURVEYTELESCOPE
EarlyQservTestsDMTR-21LatestRevision2013-08-02
userquerywasbeingqueuedupbehindamultitudeofchunk
ontheworkerside.Wearecurrentlyworkingtoaddressthis
workonsharedscans.
[1][DMTR-12],Becla,J.,2013,Qserv300nodetest,DMTR-12,URLhttps://ls.st/DMTR-
[2][LDM-135],Becla,J.,Wang,D.,Monkewitz,S.,etal.,2013,DatabaseDesign,LDM-135,URL
https://ls.st/LDM-135
[3][LDM-144],Freemon,M.,Pietrowicz,S.,Alt,J.,2016,SiteSpecificInfrastructure
Model,LDM-144,URLhttps://ls.st/LDM-144
[4][Document-26217],Kantor,J.,2010,DataChallenge3bPerformanceTest,Document-
26217,URLhttps://ls.st/Document-26217
[5][Document-9044],Kantor,J.,Axelrod,T.,Allsman,R.,Freemon,Data
Challenge3bOverview,Document-9044,URLhttps://ls.st/Document-9044
[6]Kunszt,P.Z.,Szalay,A.S.,Thakar,A.R.,2001,In:Banday,
(eds.)MiningtheSky,631,doi:10.1007/10849171_83,ADSLink
35
Back to top