LargeSynopticSurveyTelescop
    Early(pre-2013)Large-Scale
    JacekBecla,K-TLim,DanielWang
    DMTR-21
    LatestRevision:2013-08-02
    Abstract
    ThistestreportcontainsQservlarge-scaleteststhat
    searchanddevelopmentphase.Theformatisnon-standard
    testsofdifferentaspectsofQservperformedoveranumber
    wereperformedinearly2013orearlier.Manyofthese
    mentedontheLSSTTrac,andversionsofLDM-135from2013andearlier.
    LARGESYNOPTICSURVEYTELESCOPE

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Contents
    1Introduction1
    1.1Idealenvironment.........................
    1.2Scheduleoftesting.........................
    1.3Currentstatusoftests........................
    2PhaseIIQservTesting(upto100nodes)3
    2.1ir2farmtesting...........................
    2.1.1Hardware..........................
    2.1.2Initialtesting.........................
    2.1.3Further38+1nodetestingwithr18694..................
    2.1.438+1usingr19068........................
    2.2Scalabilitytesting..........................
    2.2.1Usingir2farm38+1configuration...................
    2.3boertesting...........................
    2.3.1Introduction.........................
    2.3.2Hardware..........................
    2.3.3Software...........................
    2.3.4Configuration.........................
    2.3.5InitialResults.........................
    2.4Notes.............................
    3150-nodeScalabilityTest18
    3.1Hardware............................
    3.2Data..............................
    3.3Queries.............................
    3.3.1Low-volume1–objectretrieval....................
    3.3.2Low-volume2–timeseries.....................
    3.3.3Low-volume3–spatially-restrictedfilter.................
    3.3.4Highvolume1–count......................
    ii

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    3.3.5High-volume2–full-skyfilter....................
    3.3.6High-volume3–density......................
    3.3.7Super-high-volume1–nearneighbor..................
    3.3.8Super-high-volume2–sourcesnotnearobjects..............
    3.4Scaling.............................
    3.4.1Scalingwithsmallqueries.....................
    3.4.2Scalingwithexpensivequeries....................
    3.5Concurrency...........................
    3.6Discussion............................
    3.6.1Latency...........................
    3.6.2Solid-statestorage........................
    3.6.3Manycore..........................
    3.6.4Alternatepartitioning.......................
    3.6.5Distributedmanagement......................
    4100-TBScalabilityTest(JHU20-nodecluster)31
    5ConcurrencyTests(SLAC100,000chunk-queries)33
    iii

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Early(pre-2013)Large-ScaleQserv
    1Introduction
    1.1Idealenvironment
    Basedonthedetailedspreadsheetanalysis,weexpectthe
    willbecomposedoffewhundreddatabaseservers[LDM-144],soarealistictestshould
    cludeaclusterofatleast100nodes.
    Totaldatabasesizeofasingledatareleasewillvaryfrom
    1
    Realistictestingrequiresatleast~20-30TBofstorage
    Notethatlotofhighlyfocusedtestswhichareextremelyuseful
    ofthesystemcanbedoneonaverysmall,2-3cluster,or
    exampleofthatcanbemeasuringtheeffectoftablesizeon
    join:thistypeofjoinwillbedonepersub-partition,and
    rows),thusalmostalltestsinvolvingasinglesub-partition
    withverylittlediskstorage.
    Asignificantamountoftestingshouldbedonewherethedataset
    memorysizebyanorderofmagnitude.Thistestingisimportant
    manceinthepresenceofdiskperformancecharacteristics.
    Itisessentialtohaveatleasttwodifferenttypesofcatalogs:
    dataneedstobecorrelated,thatis,theobjectsshould
    these2tableswillallowustomeasurespeedofjoins.It
    ofsource-liketables(
    DIASource
    ,
    ForcedSource
    ) – the tests
    Source
    done
    should
    with be a good
    approximation.
    The most important characteristic of the test data is its
    reflect realistic densities: presence of very crowded or
    how data is partitioned and on performance of certain queries
    inside one partition). Other than realistic spatial distribution,
    1
    These numbers are for single copy, data and indices, compressed when
    1

    LARGE SYNOPTIC SURVEY TELESCOPE
    Early Qserv Tests
    DMTR-21
    Latest Revision 2013-08-02
    valid (e.g., magnitudes) in order to try some queries with
    These tests are not only used to prove our system is capable
    but also as a mean to stress the system and uncover potential
    In practice, whoever runs these tests should well understand
    architecture system and turning MySQL.
    1.2 Schedule of testing
    •Selecting the base technology – Q2 2009
    •Determining the architecture – Q3 2009
    •Pre-alpha tests focused on parallel distributed query
    scale (~20 nodes) – Q4 2009
    •Most major features in (except shared scan, user tables),
    cluster (~100 nodes) – Q4 2010
    •Scalability and performance improvements and tests on
    – Q4 2011
    •Large scale tests, performance tests on a large cluster
    •Shared scans – Q3 2013
    •Fault tolerance – Q4 2013
    1.3 Current status of tests
    We have run several large scale tests
    1.(10/2009) A test with the “pre-alpha” version of our software
    using
    Tuson
    thecluster at LLNL (160 nodes, each node: two Intel
    with 4 GB RAM and 1 local hard disk of 80 Gbs)
    2.(2010) several 100-node tests run at SLAC. These tests
    necks and prompted rewriting parts of our software, as
    missing features in XRootD.
    2

    LARGE SYNOPTIC SURVEY TELESCOPE
    Early Qserv Tests
    DMTR-21
    Latest Revision 2013-08-02
    3.(4/2011) A 30 TB test on 150-node SLAC cluster using Qserv
    urations, using 2 billion row Object and 32 billion row
    set.
    4.(12/2012) A 100-terabyte
    Datascope
    20-node
    test on JHU’s
    cluster. Due to high
    of the cluster this test turned into testing resilience
    administrative tools.
    5.(05/2013) A 10,000-chunk test on a 120-node SLAC cluster
    subtle issues with concurrency at scale.
    6.(07/2013) A test on 300-node IN2P3 cluster to re-test
    rency, and uncover unexpected issues and bottlenecks.
    7.(08/2013) A demonstration of shared scans.
    The tests 2-5 listed above are described in further details
    DMTR-12
    .
    2 Phase II Qserv Testing (up to 100 nodes)
    This section describes the earliest testing of Qserv, during
    testing
    used
    USNO-B
    thedata
    , which
    set was not representative of the shape
    LSST data, although it represented real measurements on
    2.1 ir2farm testing
    2.1.1 Hardware
    We have been granted use (since September 2010) of 40 nodes
    ir2farm. Its main
    purpose
    BaBar
    data
    is to
    processing,
    support
    and our access will
    rary.
    2.1.1.1 Node configuration
    3

    LARGE SYNOPTIC SURVEY TELESCOPE
    Early Qserv Tests
    DMTR-21
    Latest Revision 2013-08-02
    •SUN X4100 server
    •CPU: Dual Opteron 275 processors (4 cores total), 2.2GHz,
    •Memory: 4GB
    •HD: 1 per machine for testing. binaries + data on same disk.
    Disk details: 73.5GB FUJITSU MAY2073RCSUN72G 10k SAS
    Fujitsu MAY2073RC series, 2.5”, 73.5GB, 8MB buffer,
    Seek Time: 4 ms (average) / 8 ms (max)
    Track-to-Track Seek Time 0.2 ms
    Average Latency 2.99 ms
    Spindle Speed 10000 rpm
    Mean time before failure 1,400,000 hour(s)
    Storage Hard Drive / Non-Recoverable Errors 1 per 10
    15
    –SeekErrors10per10
    8
    –1Gbitethernet,iperf:944MBits/secusingdefault
    2.1.1.2OS/software/clusterconfiguration
    •RHEL5.5Linux
    •2.6.18serieskernel
    •gcc4.1.2
    •Python2.5.2(fromLSSTstack)
    •MySQLserver5.1.45(defaulttuningsfromsourcedistrib)
    •MySQLclient5.0.45,python1.2.2(LSSTstack)
    •antlr2.7.7
    •Twisted9.0.0
    •boost1.42.0
    4

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    •swig1.3.36+2(LSSTstack_)
    •tcltk8.5a4
    •scons2.0.1
    •Developmentversionofxrootd/scalla(inc.changesnot
    •Using39of40nodes,onehadafaultydiskattimeofinstallation,
    redistributeddatasinceitgotfixed.
    •1Master,38worker.
    •Master:xrootd/cmsdinmanagermode,qserv-master,MySQL
    datadb,mysql-proxy
    •Worker:xrootd/cmsdinservermode,qserv-workeras
    dbs.
    2.1.1.3DataconfigurationUsingduplicatedPT1data.
    •PT1Objectdata,33copiesinRA,17copiesindeclination
    #!/bin/sh
    fpath=orig
    opath=.
    tname=Object
    #totalisseq0140
    foriin`seq09`;do
    beg=`expr$i\*4`
    end=`expr$beg+4`
    echo"beg$begend$end"
    pythonduplicatePT1.py--schema$fpath/$tname.sql
    --delim=,\
    --raCopies=33--declCopies=17\
    --start=$beg--end=$end\
    --box=-5.25,-4.4,5.25,4.75\
    5

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    --outPrefix=$opath/$tname\
    $fpath/$tname.txt
    done
    •Partitionedusingfixedpartitions.60stripes,18substripes.
    ./lsstpy.shpartition.py-v-t2-p4-S60-s18-PObject
    •2001chunkstotal.
    2.1.2Initialtesting
    PerformedusingtrunkSVNr17580
    •full-skyrowcount:
    SELECTCOUNT(*)FROMObject;
    –1m50s
    •Columnquery,full-sky:
    SELECTCOUNT(*)FROMLSST.ObjectWHEREgFlux_PS>3000;
    –25min3.32sec
    •1sqdegreecountquery:
    SELECTCOUNT(*)FROMLSST.ObjectWHEREqserv_areaspec_box(6,6,7,7)
    ra_PSBETWEEN6AND7ANDdecl_PSBETWEEN6AND7;
    –13.42sec
    •Fulltablescanaggregation:
    SELECTAVG(ra_PS)FROMObject;
    –25m2s
    6

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    –Effectivebandwidth:
    *1+38nodes
    *1.47Brows*1015bytes/row=~1.5Billionbytes
    *Bandwidth:995MB/sec,or26MB/secforeachdisk(raw
    •Spatially-restrictedselection(smaller):
    selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(5,5,6,6);
    –(count(*)=0)
    –25min5.09sec
    –Usingqservr18134withspatialUDF
    –withbox:6.5,6.5,7.1,7.1:11.13sec,count=17151,
    •Spatially-restrictedselection(bigger):
    selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(1,1,6,6);
    –(count(*)=590978
    –25min6.52sec
    –Usingr18134withspatialUDFprocessing
    –1994chunkstouched.
    •Spatially-restrictedselection(small:0.25sqdeg):
    selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(5.5,5.5,6,6);
    –(count(*)=0
    –12.52sec
    –Usingr18134withspatialUDFprocessing
    –2chunkstouched.
    •Near-neighborover9sqdeg
    selectcount(*)fromLSST.Objecto1,LSST.Objecto2
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,
    –(count(*)=395606096)
    –6m18.32s;6m7.63s
    –4chunkstouched.
    7

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    2.1.3Further38+1nodetestingwithr18694
    •Simplecountquery
    mysql>selectcount(*)fromObject;
    Times:9:56.84,45.42,44.61~10minuteexecutionfor
    theframeworkincursoverheadindeterminingchunk-node
    themforfutureaccess.
    •Near-neighborover100sqdegrees
    mysql>selectcount(*)fromLSST.Objecto1,LSST.Object
    whereqserv_areaspec_box(-11,-11,-1,-1)
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,
    +------------+
    |count(*)|
    +------------+
    |8261246957|
    +------------+
    1rowinset(41min54.28sec)
    Note:thereare4,373,646rowsinthisbox,spreadover
    •Concurrentnearneighborqueries.2x100sqdeg
    mysql>selectcount(*)fromLSST.Objecto1,LSST.Object
    whereqserv_areaspec_box(-11.1,-11.1,-1.1,-1.1)
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
    +------------+
    |count(*)|
    +------------+
    |8136886173|
    +------------+
    1rowinset(51min8.50sec)
    mysql>selectcount(*)fromLSST.Objecto1,LSST.Object
    whereqserv_areaspec_box(2,-11,12,-1)
    8

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
    +------------+
    |count(*)|
    +------------+
    |7763722919|
    +------------+
    1rowinset(36min41.26sec)
    •COUNT(*)isrunningabout30swhenthingsarecached.
    2000activechunks,thisisabout66chunkqueriesper
    lotoflogisticalsteps:querygeneration,nodelookup
    (network),open-readresults-close(network),write
    dumpfile,writedumpfile,readdumpfileintomysql,merge
    table.Thisisadditiontotheactualsubqueryexecution.
    faster.Mightbetimeforsomemicrobenchmarks.
    2.1.438+1usingr19068
    •COUNT(*)isnowrunningwith~12slatency.Shouldbe
    bufferingresultsonfrontend(necessarytohandlelarge
    2.2Scalabilitytesting
    2.2.1Usingir2farm38+1configuration
    2.2.1.1Testharnessat5nodesTestedwithonly5nodes,samedatadensity
    mately)as38nodecfg.Testoutsamplequeries.Actualbenchmark
    numbersinthewhereclausetopreventqueryresultcaching.
    •Stream1:
    mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
    +------------------+---------------------+
    |avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
    9

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    +------------------+---------------------+
    |3538.14419765824|120.888001608003|
    +------------------+---------------------+
    1rowinset(29min9.21sec)
    mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
    +------------------+---------------------+
    |avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
    +------------------+---------------------+
    |2431.19667946996|98.6096359211597|
    +------------------+---------------------+
    1rowinset(25min41.61sec)
    •Stream2:
    mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
    +------------------+---------------------+
    |avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
    +------------------+---------------------+
    |3538.14419765824|120.888001608003|
    +------------------+---------------------+
    1rowinset(18min32.53sec)
    mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
    +------------------+---------------------+
    |avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
    +------------------+---------------------+
    |4003.46130769641|129.511811500079|
    +------------------+---------------------+
    1rowinset(29min22.81sec)
    •Singlequery:
    mysql>selectavg(uFlux_PS),avg(uFlux_PS_Sigma)from
    +------------------+---------------------+
    |avg(uFlux_PS)|avg(uFlux_PS_Sigma)|
    +------------------+---------------------+
    10

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    |3933.44004534094|128.243156312967|
    +------------------+---------------------+
    1rowinset(14min34.65sec)
    Thesingle(non-concurrent)queryisfaster,asexpected.
    unlessscansareshared,twosimultaneousfull-scans
    aslong.
    2.3boertesting
    2.3.1Introduction
    During12/2010,wehaveobtainedtemporaryaccesstothe
    accesstoboer0021-boer0122(whereboer0042isadeadnode),
    2.3.2Hardware
    Theboerclusterconsistsof
    •SUNX2200M2servers
    •CPU:DualOpteron2218processors(4corestotal),2.613GHz,
    •Memory:8GB
    •Systemdisk:binaries
    –SATA3.0GbpsconfiguredasUDMA/133
    –SeagateST32500N
    –SEAGATEST32500NSSUN250G0732B4ES5H,3.AZK,maxUDMA/133
    –250GB
    •Datadisk:mysqltables
    –HitachiHDS7225SRev:V5DO
    11

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    –HDS7225SCSUN250G
    –250GB
    Otherconfiguration/settingsseemtobethesameasforir2farm.
    2.3.3Software
    Wehaveconfiguredthesoftwareidenticallyasforir2farm,
    datedqservsoftware.
    2.3.4Configuration
    Wehavesetuptheclusterwithone“headnode”and100“worker”
    runs:mysqlproxy,mysqld,qservfrontend,xrootd/cmsd(manager),
    Eachworkernoderuns:mysqld,xrootd/cmsd(server+libqserv_worker).
    2.3.5InitialResults
    usingtrunkr18510
    •Full-skyobjectcount:
    selectcount(*)fromObject;
    Exampleresult:
    +------------+
    |count(*)|
    +------------+
    |1546902522|
    +------------+
    1rowinset(3min18.60sec)
    Times:3m18.60s,20.48s,20.47s,20.43s
    12

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Notethefirstexecutionofthequeryisslow,sinceit
    manager.Afterthemanagerrestarts,itmustrebuild
    pings.Eachmappingisresolvedviaabroadcastmessage
    achunk.Sincewearerunninginmanager+supervisor
    takesoneadditionalnetworkhopforeachlookupresolution.
    tionalhopisincurredforconfigurationsbetween65-4096
    •1.2x1.2degreesquareselection
    mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(0,0,1.2,1.2);
    +----------+
    |count(*)|
    +----------+
    |27508|
    +----------+
    1rowinset(1.38sec)
    •1sqdegselection
    mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1,-1);
    +----------+
    |count(*)|
    +----------+
    |47336|
    +----------+
    1rowinset(2.84sec)
    •1/4sqdegselection
    mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.5,-1.5);
    +----------+
    |count(*)|
    +----------+
    |16285|
    +----------+
    1rowinset(1.44sec)
    •0.04sqdegselection
    13

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.8,-1.8);
    +----------+
    |count(*)|
    +----------+
    |2754|
    +----------+
    1rowinset(1.42sec)
    •column-bounded0.01sqdegselection
    mysql>selectcount(*)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.9,-1.9)
    +----------+
    |count(*)|
    +----------+
    |0|
    +----------+
    1rowinset(1.43sec)
    NotethatuFlux_PSisNULLinthispatchoftheskyfor
    rFlux_PSforthem.
    mysql>selectmax(uFlux_PS)fromObjectWHEREqserv_areaspec_box(-2,-2,-1.9,-1.9);
    +---------------+
    |max(uFlux_PS)|
    +---------------+
    |NULL|
    +---------------+
    1rowinset(1.42sec)
    •full-skyaverage:
    mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
    +------------------+
    |avg(uFlux_PS)|
    +------------------+
    |3526.44353629347|
    +------------------+
    1rowinset(19min11.33sec)
    14

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    2.3.5.1Usingr18512Switchedtosynchronousdispatchtoreduce
    mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
    +------------------+
    |avg(uFlux_PS)|
    +------------------+
    |3526.44353629347|
    +------------------+
    1rowinset(14min11.17sec)
    2.3.5.2Usingr18530
    Thefrontendcodewaschangedtoreducethreadcontention.
    aqueryisnowserial,andresultprocessingislimited
    file-handleandthreadcrowding.Throughputincreases,
    fectsaremoreseveresincethechunklookupsarenowserial
    patching).
    Intheresultsbelow,thecoldstarteffectsarestrongly
    However,oncethelookupsarecached,theresultingexecution
    mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
    +------------------+
    |avg(uFlux_PS)|
    +------------------+
    |3526.44353629347|
    +------------------+
    1rowinset(22min50.89sec)
    mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
    +------------------+
    |avg(uFlux_PS)|
    +------------------+
    15

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    |3526.44353629347|
    +------------------+
    1rowinset(10min35.80sec)
    mysql>selectavg(uFlux_PS)fromObjectwhereuFlux_PS
    +------------------+
    |avg(uFlux_PS)|
    +------------------+
    |3526.44353629347|
    +------------------+
    1rowinset(10min35.47sec)
    Wearetestingatoolthatwarmsupthelookupcachebypre-requesting
    advanceofrealqueries.Thebottleneckisalargechunk(28GB
    whichtakes150+secondsafterallotherchunkshavecompleted.
    sibleforincreasingexectimebeyond8minutes.Below7
    goodbalancing.MySQLtuningshouldimprovethingsfurther,
    showsanunderutilizationofdiskbandwidth.
    •Fullskyobjectcount
    SELECTCOUNT(*)FROMObject;
    18m10.62s;9.31s;19.53s;20.31s
    •Column-restrictedfull-skycount
    SELECTCOUNT(*)FROMLSST.ObjectWHEREgFlux_PS>3000;
    10min35.09sec
    •Full-skyaverageaggregation
    SELECTAVG(ra_PS)FROMObject;
    10m35.05s
    16

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    •Small1sqdegselection
    selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(-6,-6,-5,-5);
    8.41sec
    •Larger25sqdegselection
    selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(-5,-5,0,0);
    8.1sec
    selectcount(*)fromLSST.Objectwhereqserv_areaspec_box(1,1,6,6);
    7.61sec
    •Smallnear-neighborselection9sqdeg
    selectcount(*)fromLSST.Objecto1,LSST.Objecto2
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
    4m32.50s;4min16.51sec
    •Mediumnear-neighborselection100sqdeg(2concurrent
    selectcount(*)fromLSST.Objecto1,LSST.Objecto2
    whereqserv_areaspec_box(-11,-11,-1,-1)
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
    selectcount(*)fromLSST.Objecto1,LSST.Objecto2
    whereqserv_areaspec_box(2,-11,12,1)
    andqserv_angSep(o1.ra_PS,o1.decl_PS,o2.ra_PS,o2.decl_PS)
    Thesequeriesfailedtocompleteduetoamisconfiguration
    therewasnotenoughtimetofixandre-run.Theywereexpected
    proportionaltotheircoveredarea,andtonotimpede
    theyshouldbeparallelizedamongdifferentnodes.
    17

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    2.4Notes
    •2010/12/15:Wedonot(currently)support(properly)
    SELECTSUM(A)FROMFooGROUPBYB;
    sincethequeryrewriterdoesnotaddBtotherequested
    ThusthefinalmergedaggregationfailsbecausenoBexists
    •2010/11/30:Noticethataspatially-restrictedselection
    scan.Thisimpliesthatittakesabout25minutestoscan
    indexesonRAandDecl,butwithUDFsperformingspatial
    beenunabletoleveragethem.Thisarguesthatweshould
    –Approach1:subChunkIdclauseinjection–subChunkIdISindexed,soweshould
    addsomethinglike“subChunkIdIN(1,2,3,5,25,100)”
    drasticallyimproveperformance,butismorecomplex.
    –Approach2:Addra/declindexes–Thismightbedesired
    addra/declclausesthatMySQLunderstandssothat
    3150-nodeScalabilityTest
    3.1Hardware
    Weconfiguredaclusterof150nodesinterconnectedviagigabit
    quad-coreIntelXeonX5355processorswith16GBmemoryand
    disk.TestswereconductedwithQservSVNr21589,MySQL5.1.45
    patches.
    3.2Data
    Wetestedusingadatasetsynthesizedbyspatiallyreplicating
    challenge(“PT1.1”,Document-26217).Weusedtwotables:ObjectandSource.
    2Thesetwo
    2Theschemamaybebrowsedonlineat
    https://ls.st/8g4
    18

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    tablesareamongthelargestexpectedinLSST.Ofthesetwo,
    bethemostfrequentlyused.TheSourcetablewillhave50-200X
    anditsuseisprimarilyconfinedtotimeseriesanalyses
    Objecttable.
    ThePT1.1datasetcoversasphericalpatchwithright-ascension
    declinationbetween-7˚and7˚.Thispatchwastreatedas
    catedovertheskybytransformingduplicaterows’RAand
    tomaintainspatialdistanceanddensitybyanon-linear
    asafunctionofdeclination.ThisresultedinanObject
    Sourcetableof55billionrows(30TB).
    3TheSourcetableincludedonlydata
    and+54˚indeclination.Thepolarportionswereclipped
    testcluster.Partitioningwassetfor85stripeseachwith??heightof
    ~2.11˚forstripesand0.176˚forsub-stripes.Eachchunk
    mately4.5deg
    2,andeachsub-chunk,0.031deg2.Thisyielded8,983chunks.Overlap
    to0.01667˚(1arc-minute).
    3.3Queries
    ThecurrentQservdevelopmentfocusisonfeaturesforscalability.
    oftestqueriesthatdemonstrateperformanceforbothcheap
    andexpensivequeries(hour,daylatency).Runsoflowvolume
    queries,whilerunsofhighvolumequeriesandsuperhigh
    afeworevenonequeryduetotheirexpense.Allreported
    command-lineMySQLclient(MySQL).
    3.3.1Low-volume1–objectretrieval
    SELECT*FROMObjectWHEREobjectId=<objId>
    InFigure1wecanseethatperformanceofthisqueryisroughly
    seconds.Eachrunconsistedof20queries.Theslowerperformance
    eachexecutiontook9seconds,wereprobablytheresult
    Weattributetheinitial8secondexecutiontimeinRun5and
    3Sourcefortheduplicatorisavailableat
    https://github.com/lsst/qserv/blob/master/admin/python/lsst/
    qserv/admin/dataDuplicator.py
    19

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    (likelytheobjectIdindex)inthecluster.
    Figure1:Low-volumeobjectretrival.
    3.3.2Low-volume2–timeseries
    SELECTtaiMidPoint,fluxToAbMag(psfFlux),fluxToA
    FROMSource
    WHEREobjectId=<objId>
    Thisqueryretrievesinformationfromalldetectionsof
    tivelyprovidingatime-seriesofmeasurementsonadesired
    wasrandomizedasfortheLowVolume1query,whichmeant
    wheretheSourcedatawasmissingduetoavailablespace
    InFigure2weseethatperformanceisroughlyconstantatabout
    wasdoneafterLowVolume1’sRun1andwediscountits9second
    asanomalous.
    20

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Figure2:Low-volumetimeseries.
    3.3.3Low-volume3–spatially-restrictedfilter
    SELECTCOUNT(*)
    FROMObject
    WHEREra_PSBETWEEN1AND2
    ANDdecl_PSBETWEEN3AND4
    ANDfluxToAbMag(zFlux_PS)BETWEEN21AND21.5
    ANDfluxToAbMag(gFlux_PS)−fluxToAbMag(rFlux_PS)BETWEEN0.A3ND0.4
    ANDfluxToAbMag(iFlux_PS)−fluxToAbMag(zFlux_PS)BETWEEN0.A1ND0.12;
    InFigure3weseethesame4secondperformancethatwasseenfor
    queries.Again,the~9secondperformanceinRun2couldnot
    itasresultingfromcompetingprocessesonthecluster.
    21

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Figure3:Low-volumespatially-restrictedfilter.
    3.3.4Highvolume1–count
    SELECTCOUNT(*F)ROMObject
    Figure4:Highvolumecount.
    3.3.5High-volume2–full-skyfilter
    SELECTobjectId,ra_PS,decl_PS,uFlux_PS,gFlux_P
    rFlux_PS,iFlux_PS,zFlux_PS,yFlux_PS
    FROMObject
    WHEREfluxToAbMag(iFlux_PS)−fluxToAbMag(zFlux_PS)>4
    22

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Usingtheon-diskdatafootprint(MySQL’sMyISAM.MYD,
    theObjecttable(1.824x10
    12bytes),wecancomputetheaggregateeffective
    bandwidth.Run3’s7minuteexecutionyields4.0GB/sin
    whiletheotherrunsyieldapproximately11GB/sinaggregate,
    eachnodewasconfiguredtoexecuteupto4queriesinparallel,
    realistic,givenseekactivityfromcompetingqueries
    theoreticaltransferrateof98MB/s.
    Figure5:Highvolumefull-skyfilter.
    3.3.6High-volume3–density
    SELECTCOUNT(*A)Sn,AVG(ra_PS),AVG(decl_PS),chunkId
    FROMObject
    GROUPBYchunkId
    23

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Figure6:Highvolumefull-skyfilter.
    Thisquerycomputesstatisticsfortablefragments(which
    givingaroughestimateofobjectdensityoverthesky.It
    tionquerysupportinQserv.Thisqueryisofsimilarcomplexity6
    illustratesmeasuredtimessignificantlyfaster,whichis
    missiontime.AsmentionedforHV2,cachebehaviorwasnot
    inRun3maybeclose.
    3.3.7Super-high-volume1–nearneighbor
    SELECTCOUNT(*)
    FROMObjecto1,Objecto2
    WHEREqserv_areaspec_box(−5−,5,5,−5)
    ANDqserv_angSep(o1.ra_PS,o1.decl_PS,
    o2.ra_PS,o2.decl_PS)<0.1
    Thisqueryfindspairsofobjectswithinaspecifiedspherical
    ticularpartofthesky.Overtworandomlyselected100deg
    2areas,theexecutiontimes
    about10minutes(667.19secondsand660.25seconds).The
    between3to5billion.Sinceexecutionuseson-the-flygenerated
    fitinmemory,andQservdoesnotyetimplementcaching,we
    negligible.
    3.3.8Super-high-volume2–sourcesnotnearobjects
    24

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    SELECTo.objectId,s.sourceId,s.ra,s.decl,
    o.ra_PS,o.decl_PS
    FROMObjecto,Sources
    WHEREqserv_areaspec_box(224.1,−7.5,237.1,5.5)
    ANDo.objectId=s.objectId
    ANDqserv_angSep(s.ra,s.decl,o.ra_PS,o.decl_P
    Thisisanexpensivequery–an??(????)joinover150squaredegreesbetweena
    30TBtable.EachobjectIdisuniqueinObject,butisshared
    so??∼41.Werecordedtimesofafewhours(5:20:38.00,2:06:56.33,
    varianceispresumedtobecausedbyvaryingspatialobject
    areasselected.
    3.4Scaling
    WetestedQserv’sscalabilitybymeasuringitsperformance
    inthecluster.Tosimulatedifferentclustersizes,thefrontend
    queriesforpartitionsbelongingtothedesiredsetofcluster
    datasizeproportionallywithoutchangingthedatasize
    performanceat40,100,and150nodestodemonstrateweak
    3.4.1Scalingwithsmallqueries
    FromFigures7,8,and9,weseethatexecutiontimeisunaffectedby
    thedatapernodeisconstant.Thespikeinthe40-nodeconfiguration9iscausedby2slow
    queries(23sand57s);theother28executedintimesranging
    25

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Figure7:Scalingwithnodecount(1).
    Figure8:Scalingwithnodecount(2).
    Figure9:Scalingwithnodecount(3).
    3.4.2Scalingwithexpensivequeries
    HighVolume
    26

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    IfQservscaledperfectlylinearly,theexecutiontime
    nodeisconstant.InFigure10thetimesforhighvolumequeriesshowaslight
    isaprimarilyatestofdispatchandresultcollectionoverhead
    withthenumberofchunkssincethefront-endhasafixedamount
    Sincewevariedthesetofchunksinordertovarythecluster
    shouldthusvarylinearlywithclustersize.HV3seemsto
    cacheeffects–itsresultwascachedsoexecutionbecame
    TheHighVolume2queryapproximatelyexhibitstheflatbehavior
    scalability.Cachingeffectsmayhavecloudedtheresults,
    queryresultswereperfectlycached,weexpecttheoverall
    byoverheadasinHV1,andthisisclearlynotthecase.
    Figure10:Scalingwithhighvolumequeries.
    SuperHighVolume
    Thetestsonexpensivequeriesdidnotshowperfectscalability,
    surementsdidshowsomeamountofparallelism.Itisunclear
    configurationwastheslowestforbothSHV1andSHV2.Ourtime-limited
    didnotallowustorepeatexecutionsoftheseexpensivequeries
    inbetterdetail.
    27

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Figure11:Scalingwithsuperhighvolumequeries.
    3.5Concurrency
    Figure12:Concurrencytest.
    WewereabletotestQservwithmultiplequeriesinflight.
    parallelinvocationsofHV2,oneofLV1,andoneofLV2.Each
    1secondbetweenqueries.Figure12illustratesconcurr
    HV2queriestakeabouttwicethetime(5:53.75and5:53.71)
    Thismakessensesinceeachisafulltablescanthatiscompeting
    scanninghasnotbeenimplemented.Thefirstqueriesinthe
    inabout30seconds,buteachoftheirsecondqueriesseems
    queriesinthestreamsfinishfaster.Sincetheworkernodes
    forqueriesanddonotimplementanyconceptofquerycost,
    system.Theslownessoflowvolumequeriesafterthesecond
    28

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    glance,sincetheyshouldbequeuedattheendontheirassigned
    completeneartheendoftheHV2queries.Inthatcase,subsequent
    workerswithnearlyemptyqueuesandexecuteimmediately.
    byqueryskew–shortqueriesmaylandonworkersthathave
    onthehighvolumequeries.
    3.6Discussion
    3.6.1Latency
    LSST’sdataaccessneedsincludesupportingbothsmall,
    longer,hour/day-scalequeries.WedesignedQservtooperate
    avoidneedingmultiplesystems,whichwouldbecostlyin
    hardware.Indexingwasimplementedinordertoreducelatency
    touchasmallpartofthedata.
    ThecurrentQservimplementationincurssignificantoverhead
    lectingresults.Inearlydevelopmentwedecidedtominimize
    sothefront-endmasterbecameresponsibleforpreparing
    didnotneedtoperformparsingorvariablesubstitution.
    heavyweightaswell.MySQLdoesnotprovideamethodto
    instances,sotablesaredumpedtoSQLstatementsusingmysqldumpandreloadedonthe
    front-end.Thismethodwaschosentospeedprototyping,
    work,anddatabasetransactionsarestrongmotivations
    3.6.2Solid-statestorage
    SomeofQserv’sdesignchoices(e.g.,sharedscanning)
    aroundpoorseekperformancecharacteristicsofdisks.
    apracticalalternativetomechanicaldiskinmanyapplications.
    indexes,itscurrentcostdifferentialperunitcapacity
    bulkdata.Inthecaseofflashstorage,themostpopularsolid-state
    scanningisstilleffectiveinoptimizingperformancesince
    storageandflashstillhas“seek”penaltycharacteristics
    disk).
    29

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    3.6.3Manycore
    WeexpecttheperformancetobeI/Oconstrained,sincethe
    formancelimited.Itisunlikelythatmanycorescanbeleveraged
    willbesizedwithonlythenumberofdiskspindlesthatsaturate
    scanningshouldincreaseCPUutilizationefficiency.
    3.6.4Alternatepartitioning
    Therectangularfragmentationinrightascensionanddeclination,
    alizephysicallyforhumans,isproblematicduetosevere
    exploringtheuseofahierarchicalscheme,suchasthehierarchical6]for
    partitioningandspatialindexing.Theseschemescanproduce
    area,andmapsphericalpointstointegeridentifiersencoding
    subdivisionlevels.Interactivequerieswithverysmall
    operateoverasmallsetoffinepartitionIDs.Ifchunksare
    mayallowI/Otooccuratbelowsub-chunkgranularitywithout
    otherbonusisthatmature,welltested,andhigh-performance
    computingthepartitionIDsofpointsandmappingspherical
    3.6.5Distributedmanagement
    TheQservsystemisimplementedasasinglemasterwithmany
    reasonableandhasperformedadequatelyintesting,but
    instanceatLSST’splannedscalemayhaveamillionfragment
    haveplanstooptimizethequerymanagementcodepath,managing
    pointislikelytobeproblematic.Thetestdatasetdescribed
    about9,000chunks,whichmeansthatalaunchofeventhemost
    about9,000chunkqueries.
    Onewaytodistributethemanagementloadistolaunchmultiple
    issimpleandrequiresnocodechangesotherthansomelogic
    balancebetweendifferentQservmasters.Anotherwayis
    management.Insteadofmanagingindividualchunkqueries,
    groupsofthemtolower-levelmasterswhichwouldcouldeither
    30

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    groupsormanagetheindividualchunkqueriesthemselves.
    4100-TBScalabilityTest(JHU20-nodecluster)
    Inthefallof2012,wewereprovidedtheopportunitytouse
    HopkinsUniversity(JHU),thatwerelargememory,multi-processor
    largestorageattached,tosetupasaQservservice.There
    thenodeshadtwoprocessorswith12coreseachofIntelXeon
    mountedasadatavolume,were22TBraidarrays,toprovide
    Thiswouldprovideahighvolumestoragetest,butwitha
    withonemasternode,and20workernodes.
    Thedatausedwouldbeproducedoneachnode,startingwith
    testdatasetwas220GB,buthighdensitydatainoneparticular
    wide.Wechoseahighdensityspotofthisdata,andthen
    sphere,toprovideahighdensitydatasetoverthewhole
    data.
    Theproductionofthislargeamountofdataprovedtohave
    ratherslow,takingmanyweeksforafullproductiononeach
    weresetup,andthestabilityoftheclusterwasanissue,
    running,andmanysmallerproductionprocessesweresetup
    producedover100TBofcsvtextfiles,intoabout7000chunks
    done,thenthisdatawasloadedintoMySQLMyISAMtables,
    alsotookdays.
    Overthecourseofthistime,oftennodeswouldgooutwith
    amountoftime,beforecomingback.Oftenthiswouldbewith
    volumes,butthelossofcomputingnodeswouldsetbackthe
    complete.Butthiswasasmallproblem,thantheproblemof
    wereupandrunning,thedataservicewasstillhavingproblems.
    lossorcorruptionofafewofthethousandsoftablesonthe
    eitherdatawouldbere-createdorjustblockedfortesting.
    withdatacorruptionwasconstantissue,butstillalarge
    least.Alsoaproblemwasfilecorruptionontheinstallsoftware,
    31

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    bereinstalledoverthecourseofthetesting.Afull100TB
    about85TBcouldeverbeserved,beforetheresourceshad
    Butsometestingwasabletogetdoneonthiscluster.Afull
    Objectdata,althoughthiswasonlyontheaprox,85TBof
    100TBofdatathatwasgenerated.Thequerywasperformed:
    SELECTcount(*F)ROMObject
    +------------+
    |count(*)|
    +------------+
    |2059335968|
    +------------+
    1rowinset(19.26sec)
    Showing2Bobjectsinthetable.Theresultherewasafter
    times,andthecachehadbeenstabilized.Thisissimilar
    testing.
    Thelowvolumetestfromaccesstoasmallportionofthesky
    query:
    SELECTcount(*)
    FROMObject
    WHEREqserv_areaspec_box(1,3,2,4)
    ANDscisql_fluxToAbMag(zFlux_PS)BETWEEN21AND21.5
    +----------+
    |count(*)|
    +----------+
    |748|
    +----------+
    1rowinset(4.45sec)
    32

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Thistimeforaccessisalsosimilartoprevioustesting,
    smallpartoftheskyofacertaincolor.The~4.5sec.overhead
    fordataaccessinthisversionoftheqservsoftware.
    Ahighvolumedatatestwasperformed,lookingforcolor
    certainrangeofcolor.Thiswillscanoverallobjects,to
    testquerywas:
    SELECTobjectId,ra_PS,decl_PS,uFlux_PS,gFlux_P
    rFlux_PS,iFlux_PS,zFlux_PS,yFlux_PS
    FROMObject
    WHEREscisql_fluxToAbMag(iFlux_PS)−
    scisql_fluxToAbMag(zFlux_PS)>4
    Thisqueryreturned15695recordsin6min33.50sec.Again
    beroftimes,andthistimeistheaveragetimeafterthecaches
    performedagain,thistimelookingatalowernumberofrecords,
    betweeniandzfluxof5thistime.Thisqueryreturned2967records,
    timewasalittlelowerthistime,whichwasmostlythetime
    wheretherestofthetimewastheover-headinscanningthe
    theserecords.Theprevioustestsweredoneon30TBofdata,
    thesenodeshadmanylesscores.Butthistestwouldreturn
    hereitisabout375sec.Theextratimeherewillcomefrom
    datapernode,andamountofdataingeneral,andtheaccess
    5ConcurrencyTests(SLAC100,000chunk-queries)
    ApreviousversionoftheQservmastercodehaddedicated
    eachquery,onefordispatchingchunkqueries,andtheother
    dispatchpoolwassizedataquitehigh500threadsfortwo
    dispatchworkasquicklyaspossible,allowingtheQserv
    best.Secondly,thefirstquerydispatchagainstachunk
    latencyonafulltablescanof~10,000chunkstakesapproximately
    patchthreads.Subsequently,XRootDcachingallowsfor
    comparison.
    33

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    Thethreadpoolforresultreadswasgivenamuchsmallersize:
    isbecausetheQservmasterprocesscanonlyexploitlimited
    mergingworkerresultsforasinglequery.Infact,themain
    as20threadsisthatitreducesresultmergelatencywhen
    henceresultavailabilitytimes)areskewed.
    Anunfortunateconsequenceofthissimpledesignwasthat
    querieswouldcausethreadcreationfailuresintheQserv
    changedtoaunifiedquerydispatchandresultreaderthread
    Totestourabilitytohandlemanyconcurrentfull-table
    threads,wepartitionedthePT1.2[Document-9044]Objecttableinto~8,000chunks,
    tributedthemacrossan80nodeclusteratSLAC.Thenodes
    hadlimitedquantitiesofRAM,makingthemtheperfectworkhorses
    particular,askingformorethan~1000threadswouldcause
    form.Usingthenewunifiedthread-pooldesignwewereable
    and12concurrentObjecttablescanseachinvolving~8,000
    executiontimeof2to8minutes,thusdemonstratingthat
    of~100,000in-flightchunkqueries,evenonveryoldhardware.
    Notethatusingaunifiedthreadpoolforresultreadsrequires
    starvation.Asinglequerycaneasilyrequire10,000result
    totalthreadsinthepool.Asaresult,wemustbecareful
    singlequery,orqueriesthatshouldbeinteractivecaneasily
    astheywaitforatablescantofinish.Ourunifiedthread
    assignsavailablethreadstothequeryusingthefewest
    createnewthreadswhenanewqueryisencountered(upto
    Totestthis,wesetupQservwithasinglemasternodeand
    wasconfiguredwith~12,000emptychunktables.Wethensubmitted
    (SELECTCOUNT(*)FROMObject),andaninteractivequery(SELECTCOUNT(*)FROMObject
    objectIdIN(1))requiringjustasinglechunkquerytoanswer.
    demonstratethatthemasterimmediatelyallocatedthreads
    chunkqueryandreadbackitsresults,theexecutiontime
    higherthanitshouldhavebeen.Itturnsoutthisisbecause
    chunkqueryschedulingpolicy,andthesinglechunkquery
    34

    LARGESYNOPTICSURVEYTELESCOPE
    EarlyQservTestsDMTR-21LatestRevision2013-08-02
    userquerywasbeingqueuedupbehindamultitudeofchunk
    ontheworkerside.Wearecurrentlyworkingtoaddressthis
    workonsharedscans.
    References
    [1][DMTR-12],Becla,J.,2013,Qserv300nodetest,DMTR-12,URLhttps://ls.st/DMTR-
    [2][LDM-135],Becla,J.,Wang,D.,Monkewitz,S.,etal.,2013,DatabaseDesign,LDM-135,URL
    https://ls.st/LDM-135
    [3][LDM-144],Freemon,M.,Pietrowicz,S.,Alt,J.,2016,SiteSpecificInfrastructure
    Model,LDM-144,URLhttps://ls.st/LDM-144
    [4][Document-26217],Kantor,J.,2010,DataChallenge3bPerformanceTest,Document-
    26217,URLhttps://ls.st/Document-26217
    [5][Document-9044],Kantor,J.,Axelrod,T.,Allsman,R.,Freemon,Data
    Challenge3bOverview,Document-9044,URLhttps://ls.st/Document-9044
    [6]Kunszt,P.Z.,Szalay,A.S.,Thakar,A.R.,2001,In:Banday,
    (eds.)MiningtheSky,631,doi:10.1007/10849171_83,ADSLink
    35

    Back to top