Distributed key value store for large scale systems

Key-ValueStoreOverview

Key-Valuestore(KVS)isacategoryofNoSQLdatastore.KVSoften has simple datamodelthatmapfromkeytovalue.I n mathematicalview,KVScanbedefinedas follow :

Definition1.1 A general key-value store model can be demonstrated as a tuple of {K, V, S,

• Si s s t a t e s e t o f k e y - v a l u e s t o r e , e a c h s t a t e c o n s i s t o f k e y - v a l u e p a i r s[{k i , v i }]whereevery k i is different each other At a given time key-value store has a state, its stateischangedwhenthereisawriteoperation onit.

• F:K×S→V- m a p p i n g f u n c ti o n f ro m k e y s p a c e to va l u e sp a ce Function Fdependoncurr en t st a t eof k ey - va l ue sto re.

A key-value store can be treated as a simplerelationor a table of two columns:keyandvalue, wherekeyis primary key, eachkeyis mapped with an associatevalue KVShasoperations toreadandwritedata:

• put(key,value)-addkey- valuepairintoKVSorupdatenewvalueassociatedwithakey;

Among these operations,“put”,“remove”are used for writing that affect the stateSandmappingf u n c ti o nF while“ g e t ” i sr e a d i n g o p e r a ti o n u s i n g m a p p i n g f u n c ti o n F

To evaluate performance of key-value stores, we can reuse some popular storage metrics[18].ThefirstmetricisR e s p o n s e ti m e o r l a t e n c y o fa n o p e r a ti o n I ti s ti m e p e r i o d needed for completing an operation.The second metric isthroughput.Throughput can bemeasured by number of operations per seconds(ops) or bytes per seconds, etc Many factorsaffectst o p e r f o r m a n c e m e t r i c s o fk e y - v a l u e s t o r e s : i n p e r s i s t e n t s t o r a g e , d a t a a r e s t o r e d ind e v i c e s s u c h a s S S D ,

O , n u m b e r o f d i s k s e e k i n g i n o p e r a ti o n s stronglya ff e c t s t or e s p o n s e ti m e T h e c o m p l e x i t y o f a l g o r i t h m s i n k e y - v a l u e s t o r e a l s o i s anim portant f ac to r Table1 1 be l ow show sf ac tor s affec t tope r f or m anc e me tr i cs

Ind a t a s t o r e s y s t e m , w e c a n m o d e l t h e l a t e n c y o f o p e r a ti o n a s f o l l o w i n g : latency=(networkIOTime+ p ro c e ss Ti m e +file sys tem IOT im e)×c l (1.1)

Where: networkIOTime= dataTranferedSize networkSpeed ×c n (1.2) filesystemIOTime=(n seek ×t seek +t fs ×c fs )×c f (1.3) c l ,c n ,c f a r ec o n s t a n c e s p r o c e s s T i m e d e p e n d so n i n m e m o r y d a t a s t r u c t u r e c o m p l e x i t y

Mutexlock/unlock 100ns t lock c lock

30,000,000ns(HDD) 1,000,000ns(SSD) t fs c fs

Key-Value Store Taxonomy.There are many methods for classifying key-value stores,each method classifies by a criteria:Data model, scalability, persistent design.In thisthesis,key-valuestoresarecategorizedtothesetypes:

• In-memory key-value store:All data of a key-value store are placed in the mainmemory by a dictionary data structure such as B-Tree, Hash Table, R-B tree. Whenthe system crash, data are lost locally.Memcached [31] is a famous open- soure inmemorykey-valuestore.

• In-memoryfi l e b a c k e d - u p k e y - v a l u e s t o r e : Alld a t a a r e s t o r e d i n t h e m a i n m e m o r y , they are also backed-up in persistent device for crash recovery.Redis [72], RamCloud[65,71],e tc a r e examples

• Persistent key-value store: All data are placed on HDD or SSD using data structuresuchas B - Tr ee ,B + Tree,L S M-

Tree,H ash Ta ble or ahy br iddata st ru ct ur e

BigDataChallen ge s an d Motivation

BigD at aisanim por tantand po pu lar t op ici nre ce nt ye ar s Manyre searche sem ph asis Big Data characteristics:Volume - indicates the big data size,Velocity - data productionand processing speed,Variety - data types Later researches such as [7] present two morecharacteristic:Veracity - data reliability and trust andValue - the worth from exploitingBig

Data [32, 42] The first characteristic isData Volume It measures the amount of dataavailable to an organization, which does not necessarily have to own all of it as long as itcan access it [42] Large amount of data can be generated and applications have to store,analyze them to extract useful information.Large-scale systems such as social network,e-commerce need to store big data volume and query them frequently It is a challengeto manage big data volume efficiently.The second characteristic isData Velocity.Datavelocity measures the speed of data creation, streaming, and aggregation.eCommerce hasrapidly increased the speed and richness of data used for different business transactions(for example, web-site clicks) Data velocity management is much more than a bandwidthissue [42].It is a challenge for data storage system when storing data with high writingspeed and high read pressure The third characteristic isData Variety Data variety is ameasure of the richness of the data representation – text, images video, audio, etc From ananalytic perspective, it is probably the biggest obstacle to effectively using large volumesof data.Incompatible data formats, non-aligned data structures, and inconsistent datasemanticsrepresentssignificantchallengesthatcanleadtoanalytic sprawl[42]. Ina r e p o r t o f a m e e ti n g o f a g r o u p d a t a b a s e r e s e a r c h e r s [ 1 ] , m a n y c h a l l e n g e s i n B i g Dataa r e p o i n t e d o u t Thesec h a l l e n g e s c o m e f r o m t h r e e m a j o r a c c e l e r a ti n g t r e n d s :

• Cheaperd a t a g e n e r a ti n g v a r i o u s o f d a t a d u e t o i n e x p e n s i v e s t o r a g e , s e n s o r s , s m a r t devices,the e m e r gi ng In te r n e t oft hi ngs, e tc

• Cheaperp r o c e s s i n g l a r g e a m o u n t o f d a t a d u e t o a d v a n t a g e s o f m u l ti c o r e C P U , S S D , opensources oft w ar eand cheapercloudcomputing.

• Manytypesofpeopleinvolvedtoprocessofgenerating,processing,consumingdata:decision maker, domain scientist, application user, crowd workers, etc.Not onlydevelopers,administratorsin traditional enterprise;

The report states that Big Data has now become a defining challenge of our time.Itconsistsoffive researchareas[1]:

• End-to-EndPro cessin gandU nd erstanding ofD at a

• Roleso fHu m a ns int he D at a Life Cycle

The volume, velocity, and variety aspects of Big Data are dealt by the first three researchareas while Big Data applications in the cloud and managing the involvement of people intheseapplications aredealtin theremainingtworesearchareas.

In the first research area, the specific challenge is: how to build high performance andscalable system to manage bigger data sets that arrive at increasing speed, leveraging thenew and improved technologies of hardware such as new processor, storage, networking.In[1],authorsstatedthathandlinghighratesofdatacaptureandupdatesforschema- lessdata has led to the development of NoSQL systems NoSQL systems especially key- valuestoresplayimportantrolesinbuildingScalableBig/FastDataInfrastructures[26, 21,1].

The second research area presents that it is necessary to eliminate “one size fit all”point of view in Big Data, due a much wider and richer variety of data types, shapes,andsizesthantraditionalenterprisedata.Itleadstotheemergeofmultipleclassesofsystems, witheachaddressingaparticularclassofrequirement.Int h i s t h e s i s a n d o u r r e s e a r c h gr oup, we call it “specific problem, specific solution”.In this challenge, efficiently handlingdatase t s t h a t d o n o t fi ti n m a i n m e m o r y i s a n i m p o r t a n t p r o b l e m

In a paper entitled “Big data: the driver for innovation in databases” [21], Bin Cui et al.analyzed big data problems and presents applications as demand drivers.One of thesedemandingapplicationsisScalableandelasticdatamanagement.Scalableandelasticdatama nagement has been a great challenge to the database research community for morethan20years,anddifferentdistributeddatabasesystemswereproposedtodealwithlargedatasets However, the scalability of these systems is still limited due to some commonproblemsindistributedenvironment,suchassynchronizationcosts,nodefailures[21]an dperformanceinsinglenode.MostoftheexistingdatastoreserviceofCloudsystems,suchas BigTable and

Cassandra, exploit different solutions to improve the system scalability.Thetechniquesadoptedbysuchsystemsinclude:simpledatamodel,separatedmetadataa ndapplicationdata andrelaxedconsistencyrequirements[21].

Performanceo fk e y - v a l u e s t o r e i n s i n g l e n o d e i s i m p o r t a n tt o b u i l d h i g h p e r f o r m a n c e data store.There are ten rules proposed [78] for building scalable high performance datastore. These rules emphasized that it isi m p o r t a n t t o o p ti m i z e t h e s t o r e p e r f o r m a n c e i n eachnode.Consideranexampleinwhicha c u s t o m e r i s c h o o s i n g b e t w e e n t w o s o l u ti o n s , each offering linear scalability.If solution

A offers node performance that is a factor of

20betterthansolutionB,thecustomermightrequire50hardwarenodeswiths o l u ti o n A , versus1

00 0node s wi th sol uti on B.

Therefore,reducinglatencyinpersistentkey- valuestoreisw o r t h a n d m e a n i n g f u l t o saver e s o u r c e s f o r largesystems.N u m b e r o fd i s k seeki n key-valueo p e r a ti o n s strongly affects to latency metric in persistent layer.This motivates us attempting to propose anew key-value store with lower complexity There are problems in building Big FileCloudstorage and big data structure storage Linear meta-data size make it difficult to managebig file It is also difficult to store large number of big data structures with highly updaterate.S o l v i n g theseproblemismeaningfulinbothacademicandindustry.

MCS:DataStorage Framework

MemoryCache

Memory Cacheis a dictionary data structure for mapping from keys to structured values.Iti s u s e d t o f a s t a c c e s s d a t a w i t h l o w l a t e n c y Thes i z e o fM e m o r y C a c h e i sl i m i t e d , it only store frequently accessed items When the number of key-value pairs inMemoryCachereaches the configured limited size, some items will be selected and evicted using a replacement algorithm.EachMemory Cacheclass has a replacement algorithm:LRU(leastrecentused) [61,62],LFU(leastfrequentused),ARC(Adaptive ReplacementCache

[55,5 6 ] , e t c M e m o r y C a c h e w i t h a s p e c i fi c C ac h e R e p l a c e m e n t P o l i c y i s c o n fi g u r a b l e to be appropriate to workload characteristic of the each storage backend service.Weimplementedhashtableswithvariouscachereplacementalgorithmsforcachingdataan datomicvisitingfunctiontovisititeminthecacheforreadingand/orwriting.T h eMemory

• visit(key, visitor): atomic visit item associated with the key if it is existed in cache.The visit function can be used to read or modify the cache items depend on algorithmsand requirement of the service implemented in the objectvisitor.The Memory Cacheensures that there is only one visitor visit a key at a time, application does not haveto care about locking when using visit function It is useful and easy to implementmini-transactioninthevisitor.

• put(key,value):put itemintocache, itisawrappingfunctionfromthevisitfunction.

In this framework, caching functions are thread-safe.Lock-free algorithms can be used toimplementsome type of caches tr uc tures.

Key-ValueStoreAbstraction

Key-value store abstraction is a persistent layer of the framework.This layer store binarykey- valuedatatopersistentdevicesorremoteservices.Itisa b s t r a c t e d a n d s u p p o r t s multipl e implementation of key-value engines and distributed key-value store.This layercurrentlysupportZDBinChapter3,LevelDB[40],KyotoCabinet[ 4 6 ] , a n d R e m o t e K e y

- ValueservicewrappedbyZDBServicedescribedinChapter3.Withthisd e s i g n , t h e framework can support any type of key-value store in the backend.MemoryCacheandKey-Value Store Abstractionare combined to createStoragecomponent.This componentsupports serialization and then saves modified data from memory cache to persistent layerandwarm- upcacheitemwhenitismissedfromcache.T h e processofflushingmodifiedkey- valuepairsfromMemoryCache t o Key-ValueStor eAbstraction i scontrolled byStorage component.Thisprocesscanbeconfiguredtobedonesynchronouslyorasynchronously.

ServiceM odel

Service Modelis the core logical layer of the framework It implements all algorithms andfunctions for reading, atomic manipulating, modifying, removing model data Model dataineachserviceisdefineddependingontheapplicationrequirementssuchasUser-

CommerceTransaction Info.I t canbeanytypeofdatastructure wewantto store The model usesStoragecomponent to read and write data items for its logicalalgorithms.

Commit-Log

Commit-Log is a component that sequentially logs all write operations to files All runningoperationswithitsparametersareserializedusingthriftbinaryprotocol[75]togetbinarydata then be appended to commit-log files.Commit-log files are used to recover dataafter a system crash.All writing to commit-log file is append-only, it is very fast.Thiscomponentisappropriatewithin-memorystoreservice.

ProblemStatem ents

General key-value store can be viewed as a mapping fromK→V.There are someconditionsandscopethatmakespecific problems:

≤C,w h e r eCis a constance The problem is to minimize the complexity in every operationwhileensuringdatapersistentandthelimitationof mainmemorycapaci ty.

2 When a key-value pair is stored in the system, every value v in V is static, its size isunlimited:∀v∈V,|v| → ∞, visstatic.The problem is how to efficiently store,distributeandretrievelargenumberofbig values.

∀v∈ V,|v|→∞,v is dynamic.Thep r o b l e m i s h o w t o e ffi c i e n t l y s t o r e , d i s t r i b u t e andre tr ie ve bi g g r o w i n g str uc ture dval ue s

Thefirstproblemis how to minimize number of disk-seek in every operation of per- sistentk e y - v a l u e s t o r e w h i l e u s i n g o n l y l i m i t e d m e m o r y r e s o u r c e s Iti s n e c e s s a r y t o s o l v e th is problem to building high performance key-value store in each single machine/node incontext of distributed data storage environment.T o t h e b e s t o f o u r k n o w l e d g e , i t i s d i ffi c u l t tos to r e k e y - v a l u e p a i r s w i t h bi g- v al ue i n t o k e y - v a l u e s t o r e Iti s a n i m p o r t a n t r e q u i r e m e n t to take advantage of key-value store in large systems such as big file cloud storage.Manyexisting file cloud storage systems store big files into key-value store, however, their spacecomplexityofmetadataoffilearel i n e a r t o fi l e s i z e W i t h b i g - fi l e , t h e m e t a d a t a s i z e i s also big, it leads to large network and storage overhead of metadata.Thus,thesecondproblemistobuildarchitecturetos t o r e b i g fi l e s t a k i n g a d v a n t a g e s o fk e y - v a l u e s t o r e Thisc an a l s o b e s t a t e as k e y - v a l u e s t o r e f or s t a ti c b i g v al ue s

Moreover, existing key-value stores do not work well when persistently storing largenumberofbigdatastructuressuchas big-set, big dictionary.The third problemishow to build architecture, algorithms for storing big data structure with advantages ofdistributedkey-valuestore.

Scope:In the first problem, we propose new key-value store that is optimal for increasing integer key type.Many data tables in real system have primary key type ofincreasing integer.About seventy percent of data tables in large products in VietNamsuchasZalo,ZingMe,Gamesusingintegertypeasprimarykey I n secondproblem,we focus on proposing algorithms, methods in building big file cloud storage forstatic data,which can be widely applied in building content delivery networks (CDN), personal filestorage,distributedfile systemforanalytics,etc.

Contributions

In this thesis, we are trying to solve above problems for designing and implementing dis- tributedkey-val ue st or e f orl ar ge sc a l e system s.

• Proposeanewmethod,algorithmstobuildpersistentkey- valuestorage,minimizenumberofdiskseekperoperationsleadingtoreducelatency and responsetime Thisisd o n e t o m i n i m i z efi l e s y s t e m I O T i m e i nE q u a ti o n s 1.1a n d 1.3.

• Proposealgorithmsandarchitectureforstoringbigvalues,bigfilesintodistributedkey- values t o r a g e e ffi c i e n t l y T h i s m i n i m i z e s n e t w o r k I O T i m e a n d fi l e s y s t e m I O T i m e inEquations1.1and1 2 a n d m i n i m i z e s p a c e c o m p l e x i t y o f m e t a d a t a p e r b i g fi l e Theresultsofthisc o n t r i b u ti o n a r e p r e s e n t e d i n t h e t h i r d a n d t h e s i x t h p a p e r s i n thepublication l ist.

• Proposekey-valuestorearchitectureandalgorithmswhilevalueisabigdata- structure.Thisisusedforlargenumberofbigstructuressuchasbig-set,big- listontopofpro-posedkey- valuestore.T h i s minimizesnetworkIOTimeandfilesystemIOTimeinEquations1 2 an d1 3 whenupdatingbig datastructuresinkey-valuestore.P r o - poseda r c h i t e c t u r e h a s h i g h p e r f o r m a n c e a n d l a r g e r c a p a c i t y t h a n e x i s ti n g w o r k s Theresultsofthiscontributionarepresentedintheforthpaperandthefifthpa perinthepublicationlist.

Fig.1.2 :C o n t r i b u ti o n s Ove rv iew

Thesisstructure

Chapter 1 introduces main problems this thesis attempts to solve, the scope of themandsummarythecontributions.

Chapter2presentsbackgroundknowledgeandthedevelopmentofNoSQL.Itintroduc esbasicconceptsofdatastoragesystemsandpopulardatastructuretobuildkey- valuestores. Chapter3 p r e s e n t s t h e fi r s t c o n t r i b u ti o n o f t h i s t h e s i s Itp r o p o s e s n e w a r c h i t e c t u r e of key-value store optimized for auto increasing integer keys.Data are distributed usingconsistenth a s h i n g a n d m a k e i t e a s i e r t o s c a l e o u tt h e s y s t e m w he n d a t a g r o w i n g

Chapter4presentstheproposalofmethodtostorebigvaluesuchasbigfileintokey- valuestore.Itcontributesalgorithmsandarchitecturetostorebigfileefficientlyinkey- values t o r e a n d t a k e a d v a n t a g e s o f p r o p o s e d d i s t r i b u t e d k e y - v a l u e s t o r e i n C h a p t e r 3 Chapter5presentsa“ForestofdistributedB+Treeonkey- valuestore”.I t p r o p o s e s algorithmst o s t o r e k e y - v a l u e p a i r s w h i l e v a l u e i s a l a r g e d a t a s t r u c t u r e s u c h a s b i g s e t ,dictionary.

ThischapterpresentsbackgroundsofNoSQLdatabases,key- valuestoresandimportantconceptsu s e d i n t h i s d a t a s t o r a g e r e s e a r c h a r e a A c o m p r e h e n s i v e b a c k g r o u n d s t u d y o f thestate-of-the- artsystemsforscalabledatam a n a g e m e n t a r e p r o v i d e d i n t h i s c h a p t e r We further focus on a set of systems which are designed to handle update heavy workloadsforsuppor ting i n t e r ne t f ac i ng appl i c ati ons

Overview

Thedevelopmentof NoSQL

NoSQL is a term that describeNext Generation Databases They mostly address some ofthe points:being non-relational, distributed, open-source and horizontally scalable”.Itoriginally referring to “non SQL”, “non relational” or “not only SQL” [6].In a report ofComputerworld about the NoSQL meet-up organized by Johan Oskarsson in SanFrancisco2009 [47] many experiences were shared for building large-scale system such as Web 2.0without using RDBMS.These are main reasons for developing and using NoSQL datastore:

AvoidanceofUnneededComplexity.Relationaldatabasesprovideavarietyoffeatures and strict data consistency.But this rich feature set and the ACID propertiesimplemented by RDBMSs might be more than necessary for particular applications anduse cases For example: Zing Me 1 social network holds 3 copies of user session data, it isnot necessary to undergo all consistency check of relational database management systembetweenreplicasnorisitnecessarytopersistdata.Inthiscase,storingsessioninmemoryisfu llysufficient.

HighThroughput.Some NoSQL databases provide a higher data throughput andconsume resources more efficiently than traditional RDBMSs.For instance, the column- storeHypertablewhichpursuesGoogle’sBigtableapproachallowsthelocalsearchengineZvent to store one billion data cells per day [41] To give another example, Google is abletoprocess20petabyteadaystoredinBigtableviait’sMapReduceapproach.

• Scaleoutdata(e.g.3TBforthegreenbadgesfeatureatDigg,50GBfortheinboxsearchatFaceb ookor2PBintotalateBay)

MostN o S Q L d a t a b a s e s a r e d e s i g n e d t o s c a l e w e l l i n t h e h o r i z o n t a l d i r e c ti o n a n d n o t relyonhighlyavailablehardware.Thisisthemaindifferenceb e t w e e n N o S Q L a n d r e - lationaldatabasemanagementsystems.NoSQLserverscanbeaddeda n d r e m o v e d ( o r crash)w i t h o u t s y s t e m d o w n ti m e I t d o e s n o t c a u s e t h e s a m e o p e r a ti o n a l e ff o r t

RelationalMapping:Mosto f t h e N o S Q L d a t a b a s e s ared e s i g n e d t o s t o r e s i m p l e d a t as t r u c t u r e s o r o bj e c t s i n o b j e c t - o r i e n t e d p r o g r a mm i n g language.Iteliminatesexpensiveobject- relationalmapping forhigherperformanceand simplerdatamodel.Withrelationaldatab ase,althoughapplication has low complexitydatastructure,itstillhavetosufferthecomplexityofobject-relationalmapping fromRDBMSwithoutanybenefit.

Manytheoretical worksaredoneforresearchingthemovementfrom RDBMStoNoSQ L.In a widely adopted research paper“The End of an Architectural Era”[81], authors con-cluded that current Relational Database Management Systems excel at nothing while at-tempting to be a “one size fits all” solution In that paper, the authors compared RDBMSwith

“specialized engines in the data warehouse, stream processing, text, and scientificdatabase markets” which outperform RDBMS by 1-2 orders magnitude as shown in theirprevious papers [79, 80] He also compared RDBMS with a prototype at M.I.T namedH-Store [43], in their home market of business data processing/ online transaction process-ing (OLTP) using TPC-C benchmark H-Store beats up RDBMS by nearly two orders ofmagnitude.

As mentioned in [81], it is necessary to completely rewrite new specialized enginesfromscratch inbot h research co mm unit y a nd DBMS ven dor s Thereare m a n yr e as on s to develop new trend of data storage.At first, RDBMS were architected about 30 yearsago when the characteristic of hardware, user requirements and applications were differentfromthosetoday.Mo st ofpopularRDBMSinheritedarchitecturalcharacteristicofSystem R:

• Log-basedrecovery” the architecture of System R has been influenced by the hardware characteristics of the1970s Since then, processor speed, memory and disk sizes have increased enormously andtoday do not limit programs in the way they did formerly.However, the bandwidth betweenhard disks and memory has not increased as fast as the CPU speed, memory and disk size.In this regard, they underline that although“there have been some extensions over theyears,includingsupportforcompression,shared- diskarchitectures,bitmapindexes,supportforuser- defineddatatypesandoperators,etc.nosystemhashadacompleteredesignsinceitsinception

Moreover, since the 1970s when there was only business data processing, new marketsand use cases have evolved.Examples of these new markets include “data warehouses,text management, and stream processing” which “have very different requirements thanbusiness data processing” In these new markets, NoSQL outperforms RDBMS for buildingcheap, high performance, scalable systems They go on noticing that user interfaces andusage model also changed over the past decades from terminals where “operators [were]inputtingqueries”torichclientandwebapplicationstodaywhereinteractivetransactionsand direct SQL interfaces are rare Authors have evidenced that the current architectureof RDBMSs is not even appropriate for business data processing.They have designed aDBMS engine for OLTP called H- Store that is functionally equipped to run the TPC-Cbenchmark and does so 82 times faster than a popular commercial DBMS Based on thisevidence, they conclude that

“there is no market where they are competitive.As such,they should be considered as legacy technology more than a quarter of a century in age,for which a complete redesign and re-architecting is the appropriate next step”.Thesereasonsmotivatethe developmentof NoSQL Databasesresearch trends.

ScalableDataManagementforCloudComputing andBigData

The most successful paradigm of service oriented computing isCloud Computing.There is arevolution in computing infrastructure With the most popular model of Cloud Computingsuch asInfrastructure as a Service (IaaS), Platform as a Service (PaaS), and

Software as aService(SaaS)hasarapidratedevelopment.Database asaServiceorStorageasaServiceare extended concepts in of cloud computing This has seen a proliferation in the numberofapplicationswhichleveragevariouscloudplatforms,resultinginatremendousincreasein the scale of the data generated as well as consumed by such applications.Scalabledatabasemanagementsystems(DBMS)— bothforupdateintensiveapplicationworkloads,aswellasdecisionsup

Scalable and distributed data management has been the vision of the database researchcommunity for more than three decades Much research has focussed on designing scalablesystemsf o r b o t h u p d a t e i n t e n s i v e w o r k l o a d s a s w e l l a s a d - h o c a n a l y s i s w o r k l o a d s [2].

Paralleldatabasesgrewbeyondprototypesystemstolargecommercialsystems,butdistribute d database systems were not very successful and were never commercialized – rathervariousad- hocapproachestoscalingwereused.Changesinthedataaccesspatterns ofa p p l i c a ti o n s a n d t h e n e e d t o s c a l e o u t t o t h o u s a n d s o f c o m m o d i t y m a c h i n e s l e d t o t h e birth of a new class of systems referred to as Key-Value stores [26, 16] which are now beingwidelyadoptedby v ar i o us e nte rprise s.

Basicconce pts

ACIDProperties

In relational database theory and system, there is a set of properties called ACID thatguaranties database system transactions are processed reliably [37, 36] The set consistsof:Atomicity,Consistency,IsolationandDurability.

Atomicity:it either happens or it does not; either all are bound by the contract ornonea r e A t o m i c i t y r e q u i r e s t h a t e a c h t r a n s a c ti o n b e “ a l l o r n o t h i n g ” : if o n e p a r t o f the transaction fails, the entire transaction fails, and the database state is left unchanged.An atomic system must guarantee atomicity in each and every situation, including powerfailures, errors, and crashes To the outside world, a committed transaction appears (byits effects on the database) to be indivisible (“atomic”), and an aborted transaction doesnothappen.

Consistency: this property ensures that any transaction will bring the database fromone valid state to a new valid state.Any data written to the database must be validaccording to all defined rules, including constraints, cascades, triggers, and any combinationthereof This does not guarantee correctness of the transaction in all ways the applicationprogrammer might have wanted (that is the responsibility of application-level code) butmerelythatanyprogrammingerrorscannotresultintheviolationofanydefinedrules.

Isolation:t h i s p r o p e r t y e n s u r e s t h a t t h e c o n c u r r e n t e x e c u ti o n o f t r a n s a c ti o n s r e s u l t ina systemstate thatwould beobtained ift r a n s a c ti o n s w e r e e x e c u t e d s e r i a l l y , i e o n e after the other.Providing isolation is the main goal of concurrency control.Depending onconcurrency control method, the effects of an incomplete transaction might not even bevisibletoan ot he r tr ans ac tion.

Durabilitymeans that once a transaction has been committed, it will remain so, evenintheeventofpowerloss,crashes,orerrors.Inarelationaldatabase,forinstance,onceagroup ofSQL statements execute, the results need to be stored permanently (even if thedatabasecrashesimmediatelythereafter).T o defendagainstpowerloss,transactions(or theire ffe c t s) m u s t b e r e c o r de di n a n on - v ol ati l e m e m or y.

Consistency,AvailabilityandPartitiontolerance

At ACM’s PODC 1 symposium in 2000, Eric Brewer came up with the so called CAP- theorem [13] in a keynote titled “Towards Robust Distributed Systems”, which is widelyadopted today by large web companies (e g Amazon, cf [88] ) as well as in the NoSQLcommunity.TheCAPtheoremstatesthatitisimpossibleforadistributedcomputersystemto simultaneously provide all three of the following guarantees:Consistency,AvailabilityandPartitionTolerance.

Consistency: meaning if and how a system is in a consistent state after the executionof an operation.A distributed system is typically considered to be consistent if after anupdateoperationofsomewriterallreadersseehisupdatesinsomeshareddatasource.Ifdata is replicated to multiple nodes, all nodes see the same data at the same time. Thisdefinitionisdifferent toConsistencydefined inACID.

Availability:a system is designed and implemented in a way that allows it to continueoperation( i e a l l o w i n g r e a d a n d w r i t e o p e r a ti o n s ) i f e g n o d e s i n a c l u s t e r c r a s h o r some hardware or software parts are down due to upgrades.This guarantee that everyrequestreceivesares pons eabout w hethe ritsucceeded orfail ed.

PartitionTolerance:u n d e r s t o o d astheabilityofthesystemtocontinueoperation i nthepresenceofnetworkpartitions T h e s e occuriftwoor morenetworknodes(tempor arilyorpermanently)cannotconnecttoeachother P a r ti ti o n toleranceisalsotheabilit yofasystemtocopewiththedynamicadditionandremovalofnodeswithoutdowntime.T h e systemisstillabletooperatedespitearbitrarymessagelossorfailureofpartofthesystem.“Thee a s i e s t w a y t o u n d e r s t a n d C A P i s t o t h i n k o f t w o n o d e s o n o p p o s i t e s i d e s o f apartition Allowingatleastonenodetoupdatestatewillcause thenodestobecome inconsistent,thusforfeitingC.Likewise, ifthechoiceistopreserveconsistency,oneside of the partition must act as if it is unavailable, thus forfeiting A Only when nodes com- municate is it possible to preserve both consistency and availability, thereby forfeiting P.”[12].AccordingtoCAPtheorem,thereareseveraloptionsforsystemtodeciceitsfeatures.Thiscanbes umarizedinTable2.1.

Single-sited a t a b a s e s Clusterdatabases LDAP xFSfilesystem Consistency+Partitiontolerance

Inthemid-1990s,EricBrewandhiscolleagueshadbuiltavarietyofcluster-basedwide-area systems (essentially early cloud computing), including search engines, proxy caches,and content distribution systems [14].In these products, system availability was a premiumrequirement,sotheyfoundthemselvesregularlychoosingtooptimizeavailabilitythroug hstrategies such as employing caches or logging updates for later reconciliation. Althoughthesestrategiesdidincreaseavailability,thegaincameatthecostofdecreasedconsi stency.

TheCAP-theoremisre-analysisby itsauthor inanarticle [12].

EventualConsistency

In distributed computing, to achieve high availability, a consistency model called eventually- consistency is used to informally guarantee that, if no new update is made to a given dataitem,all accessestothatitemwillreturnthelatestupdatedvalueeventually [88].This modelhasoriginsinearlymobilecomputingprojects[84]andiswidelyappliedind i s - tributedsystems.Itisoftenusedforbuildingoptimisticr e p l i c a ti o n i n d a t a s t o r a g e A s ystem is called to have converged or achieved replica convergence when it has archivedeventualconsistency.EventualConsistencyisanalyzedcarefullyi n a p a p e r o f W e r n e r Vogels[88].ItisanimportantprincipleforbuildingAmazon’sk e y - v a l u e s t o r e n a m e d Dynamo [26].There are two point of view when looking at consistency:Client-side consis-tency and Server-side consistency.I n C l i e n t - s i d e , d e v e l o p e r s c a r e a b o u t h o w d a t a u p d a t e s are observed.In server- side, how updates flow through the system and what guaranteessystemsc an gi ve wi thr e spe c ttoupdate s.

Clientsi de c a n be vi e w e d as a c ol le c ti on o ft he se c om po ne nt s:

• Astorages ys t e m:I t issomething oflarge scaleandhighly distributed, andth atitisbuilttoguaranteedurabilityand availability.

• ProcessBandC.ThesetwoprocessesareindependentofprocessAandwritetoandread from the storage system It is irrelevant whether these are really processes orthreadswithinthesameprocess;whatisimportantisthattheyareindependentandneedtoc ommunicatetoshareinformation.

Client- sideconsistencycareaboutwhenand howobservers(inthiscasetheprocessesA,B,C) see the updates of a data object in storage systems Examples below illustrate varioustypesofclientconsistencywhenprocessAhasmadeachangingtoadataobject.

• Strongconsistency.A ft e r theupdatecompletes,anysubsequentaccess(byA,B,orC) willreturntheupdated value.

Thes y s t e m d o e s n o t gu ar an te et hat s u b s e q u e n t ac ce ss es wi ll returntheupdate dvalue.A numberofconditionsneedtobemetbeforethevaluewill be returned.The period between the update and the moment when it is guaranteedthatanyobserverwillalwaysseetheupdatedvalueisdubbedtheinconsistencywi ndow.

• Eventual consistency.This is a specific form of weak consistency; the storage systemguarantees that if no new updates are made to the object, eventually all accesses willreturn the last updated value.If no failures occur, the maximum size of the inconsis-tency window can be determined based on factors such as communication delays, theloado n t h e s y s t e m , a n d t h e n u m b e r o f r e p l i c a s i n v o l v e d i n t h e r e p l i c a ti o n s c h e m e The most popular system that implements eventual consistency is the domain namesystem( D N S ) U p d a t e s t o a n a m e a r e d i s t r i b u t e d a c c o r d i n g t o a c o n f i g u r e d p a tt e r n andincombinationwithtime- controlledcaches;eventually,allclientswillseetheupdate.

There are several important variations of Eventual Consistency model:The first one isCausal consistency If process A has communicated to process B that it has updated adataitem,a subsequentaccess byprocessBwillreturntheupdatedvalue, an d awr iteis guaranteed to supersede the earlier write.Access by process C that has no causalrelationship to process A is subject to the normal eventual consistency rules The secondone isRead- your-writes consistency.This is an important model where process A, afterhaving updated a data item, always accesses the updated value and never sees an oldervalue. This is a special case of the causal consistency model The third important variationisSession consistency This is a practical version of the previous model, where a processaccessest h e s t o r a g e s y s t e m i n t h e c o n t e x t o f a s e s s i o n A s l o n g a s t h e s e s s i o n e x i s t s , the system guarantees read-your-writes consistency.If the session terminates because of acertainfailurescenario,anewsessionmustbecreatedandtheguaranteesdonotoverlapthesessi ons.TheforthoneisMonotonicreadconsistency.Ifaprocesshasseenaparticular valuef o r t h e o b j e c t , a n y s u b s e q u e n t a c c e s s e s w i l l n e v e r r e t u r n a n y p r e v i o u s v a l u e s T h e fiftho n e i sM o n o t o n i c w r i t e c o n s i s t e n c y.I n t h i s c a s e , t h e s y s t e m g u a r a n t e e s t o s e r i a l i z e thew r i t e s b y t h e s a m e p r o c e s s S y s t e m s t h a t d o n o t g u a r a n t e e t h i s l e v e l o f c o n s i s t e n c y arenotor iously di ffi c ul tto progr am [8 8 ] WernerV o g e l s d e s c r i b e d s e r v e r - s i d e c o n s i s t e n c y i n [ 8 8 ] w i t h f o l l o w i n g d e fi n i ti o n : N=Numberofreplican odesthatstoredata.

IfW + R> N ,t h e n t h e w r i t e s e t a n d t h e r e a d s e t a l w a y s o v e r l a p a n d o n e c a n g u a r a n - tee strong consistency.In the primary-backup RDBMS scenario, which implements syn- chronous replication, N=2, W=2, and R=1 No matter from which replica the client reads,it will always get a consistent answer In the asynchronous replication case with readingfrom the backup enabled, N=2, W=1, and R=1 In this case R+W=N, and consistencycannotbeguaranteed.

The problems with these configurations, which are basic quorum protocols [67], is thatwhenb e c a u s e o f f a i l u r e s t h e s y s t e m c a n n o t w r i t e t o W n o d e s , t h e w r i t e o p e r a ti o n h a s t o fail,markingtheunavailabilityofthesystem.WithN:ndW:ndonlytwon o d e s av ailable,th e sys te m wi ll h av e tof ai l t he wr i te

In distributed storage systems that provide high performance and high availability thenumber of replicas is in general higher than two.Systems that focus solely on fault toleranceoften use N=3 (with W=2 and R=2 configurations) Systems that must serve very highread loads often replicate their data beyond what is required for fault tolerance; N can betens or even hundreds of nodes, with R configured to 1 such that a single read will returna result.Systems that are concerned with consistency are set to W=N for updates, whichmaydecreasetheprobabilityofthewritesucceeding.A commonconfiguration forthe se systemst h a t a r e c o n c e r n e d a b o u t f a u l t t o l e r a n c e b u t n o t c o n s i s t e n c y i s t o r u n w i t h W = 1 tog e t m i n i m a l d u r a b i l i t y o f t h e u p d a t e a n d t h e n r e l y o n a l a z y ( e p i d e m i c ) t e c h n i q u e t o updatethe othe r re pli c as.

Weak/eventual consistency arises when W+R

Tiêu đề	Distributed Key Value Store for Large Scale Systems
Tác giả	Thanh Trung Nguyen
Người hướng dẫn	Assoc. Prof. Dr. Hieu Minh Nguyen
Trường học	Le Quy Don Technical University
Chuyên ngành	Information Technology
Thể loại	thesis
Năm xuất bản	2016
Thành phố	Ha Noi

Định dạng
Số trang	234
Dung lượng	2,68 MB