Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 241 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
241
Dung lượng
2,68 MB
Nội dung
DISTRIBUTEDKEYVALUESTOREFORLARGESCALESYSTEMS ThanhTrungNguyen, The thesis is submitted to Le Quy Don Technical UniversityforthedegreeofPh.D attheFacultyofInformationTechnology ResearchSupervisor Assoc.Prof.Dr.Hieu Minh NguyenHaNoi2016 Declaration Ideclarethatthisthesiscontainsnomaterialthathasbeenacceptedfortheawardoranyotherdegreeordiploma inanyuniversityorotherinstitution.Tothebestofmyknowledgeandbelief,thisthesiscontains nomaterialthat ispreviouslypublishedor writtenbyanotherperson,exceptwhereduereferenceismadeinthetextofthethesis PHDCandidate NguyenTrungThanh Abstract In recent decades, network application systems have been growing rapidly Not only Business data processing or Online Transaction Processing (OLTP) applications in the market,but also many new types of application such as Text Management (eg.Search Engine),Data Warehouses, Stream Processing, Scientific and Intelligent Databases have been de-veloped and being researched.Particularly, the size of many applications such as socialnetwork, online commerce system, personal cloud storage has exponentially increased Insuch huge applications, building high performance, scalable data storage system is one ofthe biggest challenges In order to efficiently address the data storage problem, new mech-anisms in building data storage have been developed to fill the gap that the traditionalrelationald a t a b a s e m a n a g e m e n t s y s t e m c a n n o t d o T h e n e w d a t a s t o r a g e m e c h a n i s m is called NoSQL Key-value store is one of NoSQL data store schemes and it plays animportantroleinmanylargesystems Distributed key-value store is an extension of key-value store that supports data distribution across multiple servers The quality and capability of a distributed key- valuestored e p e n d o n m a n y f a c t o r s , i n c l u d i n g p e r f o r m a n c e o f k e y v a l u e s t o r e s i n e a c h n o d e , the efficiency of data distribution algorithms, the structure of storing system for a specificdata type, consistency model, the ability of storing big-values and big data structures, etc.There are three important questions in building high performance distributed key- valuestores:H o w tominimizelatencyandmaximizethroughputoftheKeyValuestorewith minimum memoryoverhead in persistent layer? How to store big values into keyvaluestoreorhowtomanagelargenumberofbig-filesinacloudstoragewiththeadvantagesofdistributedkey-valuestore? Howtoefficientlystoreanddistributebigdata-structuresinkey-valuestore? ThisthesisattemptstoaddressthesequestionsbystudyingthemovementfromRDBMSto databases and exploring techniques in designing Key-Value stores NoSQL After ana- lyzingsomecommonapproachesfortransformingfromRDBMStoNoSQL,weconductedseveral experiments in order to reveal mechanisms of each approach For one of the mostpopular key types (auto increasing integer key), a high performance key-value store wasproposed to minimize the latency of both read and write operations This method ensuresthat there is a maximum of one disk seek per operation and memory overhead per key isfixed After that, we analyzed some popular existing personal cloud storage systems thatshowed that the space complexity of metadata of files in these systems is linear to file sizeor O(n) In other words, most existing key-value stores and database systems often lackthe ability to efficiently store big values such as big files in cloud storage Consequently,we proposed a new architecture and algorithms for big-file cloud using the advantages ofkey- valuestorethatreducethespacecomplexityof metadataoffilesfromO(n)toO(1) Finally, we proposed the Forest of Distributed B+Tree based on key-value store forstoring big data structure such as Sets and Maps The novelty of our method is that thisstructure supports for distributing partial value.This ability is not supported in someexisting popular systems such as BigTable, Cassandra where each row of these storagesystems must be fitted in one server Our method allows us to build huge-row storage inwhich rows are large than in Google BigTable and Cassandra.In summary, this thesisstudiesandproposesthemethodsforefficientstoringdatainlarge-scalesystems Acknowledgements First of all, I would like to express my gratitude to my supervisor Minh Hieu Nguyen forhisguid an ce, e x p e r i e nc e s anden co urage ment th ro ug hout myPh Djourney Iwo u ldli ke to thank Uy Quang Nguyen for his support and many helpful insights for my research IthankA s s o c i a t e P r o f e s s o r L a m T h u B u i f o r h i s m e a n i n g f u l q u e s t i o n s i n t h e e a r l y s t a g e ofmyresearch,theyhelpedmealottoimprovemyacademicpointofviewasascienceresearcher.I thank my cool office mates, Loi Van Cao, Thien Duc Nguyen, etc.for funtimes and for a lot of interesting discussions I want to give my appreciation to Dung HongLuufromNetworkSecurity groupforhissupports,adviceandcollaboration.I also takethis opportunity to thank all my colleagues in the Department of Network Security, fortheir effort to make the department such an excellent environment to work I would like tospeciallythankThanhTaMinh,LyVuThiforalwaysbeinghelpfulandresponsive.I thankto the Research Fund RF @ K12 that supported me a lot to publish my research result Ithank to Research and Development Department of VNG Corporation for supporting biginfrastructure and large real data sets for this research.I thank friends in Research andDevelopment Department of VNG Corporation: Anh Nguyen Hoai,Tin Khac Vu, my littlebrother Trung Thanh Nguyen and Tung Chi Vu You all have made the team such a greatplacetoworkandhelpedmetoapplyourresearchresultsinrealproducts Most importantly, I wish to thank my family for their endless and unconditional love,for their sustained support and encouragement alwaysbeenthereforme I am so grateful that they have Contents Declaration i Abstract ii Listo f F i g u r e s ix Listo f T a b l e s xi Abbreviations xii Introduction 1.1 Key-ValueStoreOverview 1.2 BigDataChallengesandMotivation 1.3 MCS:DataStorageFramework 1.3.1 MemoryCache 1.3.2 Key-ValueStoreAbstraction 10 1.3.3 ServiceModel .11 1.3.4 Commit-Log 11 1.4 ProblemStatements 11 1.5 Contributions 13 1.6 Thesisstructure 14 Backgrounds 15 2.1 2.2 2.3 Overview 15 2.1.1 ThedevelopmentofNoSQL 15 2.1.2 ScalableDataManagementforCloudComputingandBigData 19 Basicconcepts 19 2.2.1 ACIDProperties 20 2.2.2 Consistency,AvailabilityandPartitiontolerance .21 2.2.3 EventualConsistency 22 2.2.4 TheBASEConsistencyModel .27 2.2.5 Partitioning .28 2.2.6 Datastructuresforpersistentlayerinkey-valuestore 30 CloudStorageBenchmarksandWorkloads .32 Highperformancekey-valuestoreforlarge-scalestorageservice 35 3.1 Introduction 36 3.2 Relatedworks 38 3.3 SequentialLogStorageModel 41 3.4 ProposedKey-ValueStore 42 3.4.1 DatastructurefortheIndex 43 3.4.2 DataLayoutinPersistentFile .46 3.4.3 Key-ValueDataTableandMainAlgorithms 46 3.4.4 Implementation 54 3.5 AnalysisandComparisonwithotherkey-valuestores .55 3.6 PerformanceEvaluation 57 3.6.1 StandardBenchmark .57 3.6.2 3.6.3 3.7 EngineEvaluation 60 Discussion 63 Summary 64 High-PerformanceDistributedBig-FileStorageBasedOnKey-ValueStore66 4.1 Introduction 67 4.2 RelatedWorks 69 4.3 ProposedMethodforBigFileStorageSystem 71 4.4 4.5 4.3.1 GeneralB i g F i l e M o d e l .72 4.3.2 ArchitectureOverview 72 4.3.3 LogicalDatalayout 74 4.3.4 ChunkStorage 75 4.3.5 Metadata 76 4.3.6 Datad i s t r i b u t i o n a n d r e p l i c a t i o n 79 4.3.7 Uploadinganddeduplicationalgorithm 79 4.3.8 Downloadinga l g o r i t h m 81 4.3.9 SecureDataTransferProtocol 82 Evaluation 83 4.4.1 EvaluateKey-ValuestoreforBFC 84 4.4.2 Locallyp e r f o r m a n c e c o m p a r i s o n 85 4.4.3 Metadatac o m p a r i s o n 85 4.4.4 Deduplication 86 Summary 87 Forest of distributed B+Tree for storing large number of big and growingsetsbasedonkey-valuestore 89 5.1 Introduction 90 5.2 BigSe t P ro bl e m St a t em e n t 92 5.2.1 Problem 92 5.2.2 Complexity 92 5.3 Relatedworks 93 5.4 ForestofDistributedB +TreeforsolvingBig-SetProblem .95 5.4.1 MethodOverview 96 5.4.2 ForestofDistributedB+TreeDefinition 98 5.4.3 LeafNodesofDistributed B+Tree 98 5.4.4 Non-leafNodesofDistributedB+Trees 101 5.4.5 ForestofDistributedB+Tree 106 5.4.6 Generalkey-valuestoreusingForestofDistributedB+Tree 110 5.5 Evaluation 111 5.6 Discussion 114 5.7 ApplicationsofZDBandForestofDistributedB+Tree 116 5.8 5.7.1 ComputingArchitectureforAnomalyDetectionSystem 116 5.7.2 SpecificS torageS olutionforS pec ificS tru cturedD at a 118 Summary 120 Conclusiona nd Futurewo rks 121 Publications 124 Bibliography 125