Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 178 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
178
Dung lượng
2,98 MB
Nội dung
AdaptiveCacheAwareMultiprocessorSchedulingFramework (Research Masters) A THESIS SUBMITTED TO THE FACULTY OF SCIENCE AND TECHNOLOGY OF Q UEENSLAND U NIVERSITY OF T ECHNOLOGY IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF R ESEARCH M ASTER Huseyin Gokseli Arslan Faculty of Science and Technology Queensland University of Technology September 2011 Copyright in Relation to This Thesis c Copyright 2011 by Huseyin Gokseli Arslan All rights reserved Statement of Original Authorship The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made Signature: Date: i ii This thesis is dedicated to my dearest family and my beloved one for their love, endless support iii iv Abstract Computer resource allocation represents a significant challenge particularly for multiprocessor systems, which consist of shared computing resources to be allocated among co-runner processes and threads While an efficient resource allocation would result in a highly efficient and stable overall multiprocessor system and individual thread performance, ineffective poor resource allocation causes significant performance bottlenecks even for the system with high computing resources This thesis proposes a cacheawareadaptive closed loop schedulingframework as an efficient resource allocation strategy for the highly dynamic resource management problem, which requires instant estimation of highly uncertain and unpredictable resource patterns Many different approaches to this highly dynamic resource allocation problem have been developed but neither the dynamic nature nor the time-varying and uncertain characteristics of the resource allocation problem is well considered These approaches facilitate either static and dynamic optimization methods or advanced scheduling algorithms such as the Proportional Fair (PFair) scheduling algorithm Some of these approaches, which consider the dynamic nature of multiprocessor systems, apply only a basic closed loop system; hence, they fail to take the time-varying and uncertainty of the system into account Therefore, further research into the multiprocessor resource allocation is required Our closed loop cacheawareadaptiveschedulingframework takes the resource availability and the resource usage patterns into account by measuring time-varying factors such as cache miss counts, stalls and instruction counts More specifically, the cache usage pattern of the thread is identified using QR recursive least square algorithm (RLS) and cache miss count time series statistics For the identified cache resource dynamics, our closed loop cacheawareadaptiveschedulingframework enforces instruction fairness for the threads Fairness in the v context of our research project is defined as a resource allocation equity, which reduces corunner thread dependence in a shared resource environment In this way, instruction count degradation due to shared cache resource conflicts is overcome In this respect, our closed loop cacheawareadaptiveschedulingframework contributes to the research field in two major and three minor aspects The two major contributions lead to the cacheawarescheduling system The first major contribution is the development of the execution fairness algorithm, which degrades the co-runner cache impact on the thread performance The second contribution is the development of relevant mathematical models, such as thread execution pattern and cache access pattern models, which in fact formulate the execution fairness algorithm in terms of mathematical quantities Following the development of the cacheawarescheduling system, our adaptive self-tuning control framework is constructed to add an adaptive closed loop aspect to the cacheawarescheduling system This control framework in fact consists of two main components: the parameter estimator, and the controller design module The first minor contribution is the development of the parameter estimators; the QR Recursive Least Square(RLS) algorithm is applied into our closed loop cacheawareadaptiveschedulingframework to estimate highly uncertain and time-varying cache resource patterns of threads The second minor contribution is the designing of a controller design module; the algebraic controller design algorithm, Pole Placement, is utilized to design the relevant controller, which is able to provide desired timevarying control action The adaptive self-tuning control framework and cacheawarescheduling system in fact constitute our final framework, closed loop cacheawareadaptiveschedulingframework The third minor contribution is to validate this cacheawareadaptive closed loop schedulingframework efficiency in overwhelming the co-runner cache dependency The timeseries statistical counters are developed for M-Sim Multi-Core Simulator; and the theoretical findings and mathematical formulations are applied as MATLAB m-file software codes In this way, the overall framework is tested and experiment outcomes are analyzed According to our experiment outcomes, it is concluded that our closed loop cacheawareadaptiveschedulingframework successfully drives co-runner cache dependent thread instruction count to co-runner independent instruction count with an error margin up to 25% in case cache is highly utilized In addition, thread cache access pattern is also estimated with 75% accuracy vi 134 CHAPTER EXPERIMENTAL SETUP AND SIMULATION Figure 6.18: Adaptive Miss Count and Actual Miss Count Apart from tracking capability, the framework is also able to consider cache pattern of the thread during this process to provide a proactive control rather than a reactive one For a calculated tracking error, the controller calculates the number of processor cycles, controller output, which is required to drive this error to zero This process also considers the cache pattern of the target thread The instruction count tracking error to processor cycle conversion is necessary due to the fact that the processor is only capable of allocating static resources rather than dynamic ones such as instruction counts dependent on each thread cache characteristic Here, our adaptivescheduling controller performs this mapping based on the estimated cache pattern of the target thread Then, the controller output (processor cycle) is allocated by the processor based on the request of the controller Considering this framework, a series of experiments is conducted using the experiment design pattern mentioned in the Section 6.1 These experiments indicate three different cases for the variant of co-runner pair threads For a heavy-weight co-runner thread pairs, our adaptivescheduling is successful in improving instruction count of the target thread towards the reference instruction count For a light-weight co-runner thread pair, our adaptivescheduling performance is dependent on how close the actual instruction count is to the reference instruction count In this experiment, the reference instruction count is selected as a target thread instruction count in a dedicated cache environment In this case, our cacheawareadaptive closed 6.4 CONCLUDING REMARKS 135 loop schedulingframework can be considered as a passive component in the processing platform For a highly nonlinear and unstable target thread such as mesa, our schedulingframework might have an error margin of up to 25 percent In fact, this is completely dependent on the co-runner pair instruction count statistics Depending on persistency and consistency of thread statistics, this error margin can be varied from percent to 25 even 50 percent However, our experiment states that our cacheawareadaptive closed loop schedulingframework performs reasonably well for most of the threads As mentioned above, one of the significant contributions of this framework is in fact estimating and considering thread cache patterns This improves the robustness and practicability of the framework According to our experiment results, despite a small number of hikes and jumps, cache pattern estimation is an efficient component in the overall framework All in all, experiment outcomes indicate that cacheawareadaptive closed loop schedulingframework is a reasonable solution, which considers cache pattern of the threads and compensates the performance degradation due to inefficient cache allocation with an additional processor cycles allocation 136 CHAPTER EXPERIMENTAL SETUP AND SIMULATION Chapter Conclusions and Recommendations 7.1 Summary This thesis has successfully constructed a cacheawareadaptive closed loop scheduling framework, which in fact merges multidisciplinary research areas including the modern control theory and computer systems In line with our objectives and contributions stated in the introduction, two major and three minor achievements can be underscored: • First of all, the execution fairness algorithm has successfully been applied into the cacheawareadaptiveschedulingframework As previously stated in the introduction, this algorithm has achieved an instruction count of actual thread to converge to cache dedicated (reference) instruction count In this way, co-runner dependency on the thread performance has been eliminated • A thread execution model has successfully been developed This state-space model has achieved time series formulation of coupled instruction count and cache resource dynamics of the thread and co-runner threads These sets of equations are in fact referring to the thread execution and cache access patterns • Based on the developed thread execution model, QR Recursive Least Square (RLS) parameter estimation algorithm has been successfully applied to estimate the instant cache pattern of the thread To achieve this, the regression cache miss model has been developed using the time series statistics of miss count and co-runner miss count metrics QR RLS algorithm provides a set of polynomial coefficients representing the cache behaviors of 137 138 CHAPTER CONCLUSIONS AND RECOMMENDATIONS the thread at that particular time for the given statistics • Using estimated cache access and execution patterns of the thread, an algebraic controller design algorithm (pole placement) has been successfully applied to the adaptive selftuning control framework, which enforces actual instruction count to track the cache dedicated (reference) instruction count irrespective of cache patterns of co-runner and actual thread In the pole placement algebraic design algorithm, the designer has a flexibility in determining the closed loop system response In our case, stable cacheawareadaptiveschedulingframework response has successfully been designed • The first four achievements can be considered as the theoretical foundation of the actual adaptive self-tuning control framework design However, the last achievement is the development of the patch software module, which adds a few functional modules and time series hit/miss counters to the M-Sim System Simulator Following the theoretical foundation of the cacheawareadaptive closed loop schedulingframework and the development of statistics retrieval tools, a number of experiments have been conducted and the results are analyzed using SPEC CPU2000 benchmark threads Based on these observations, it is concluded that the framework is particularly efficient for co-runner threads, which have high cache requirement, and relatively inefficient with co-runners with low cache demands 7.2 Future Work and Recommendations Some potential research areas can be spotted on the basis of literature and research outcomes: 7.2.1 Heterogeneous Multiprocessor Architecture Resource Allocation Problem In contrast homogeneous multiprocessor architecture, heterogeneous multiprocessor architecture has a specific run-time fault handling, power management and computational resources for each core For instance, while one core has a low power consuming micro processor with limited cache and processor resources, the other one can be a powerful graphical processing core with high memory and processor resources In such architecture, two important research questions have been developed: 7.2 FUTURE WORK AND RECOMMENDATIONS 139 • How can threads be allocated among these heterogeneous cores such that maximum resource efficiency can be achieved? • What kinds of strategies can be developed to estimate thread resource requirements at run time? Due to the fact that specialization and corresponding resources of each cores are significantly diversified, it is very critical to correctly allocate each thread to the suitable cores Otherwise, a significant degradation in thread and allocated core performance is inevitable There are two main approaches in core/resource allocation problems: static and dynamic approaches Hence, an efficient cache and resource management framework is essential In other words, the resource and role management on dynamic heterogeneous multicore processors (DHMPs) can be a potential research field for future research work 7.2.2 Statistical Pre-processing of Real-Time Statistical Information Another challenge in any adaptive or dynamic framework is the estimation of the system state based on the past measurements or states Despite the fact statistical or deterministic estimation methods have a straight forward approaches, there has been a significant number of measurement samples with highly nonlinear and unpredictable patterns This generally has a negative impact on the estimation process efficiency, and indirectly on the overall system performance Hence, the pre-processing of statistics prior to forming a conclusion about the system state is necessary to remove measurement samples, which are incoherent In fact, this elimination process ensures system stability and accuracy In this field of research, there has been a vast range of methods ranging from a simple filter to highly complicated data-mining algorithms 7.2.3 Robust Adaptive Control Theory Another potential research field, which is relevant to our research project, is the robust control theory, which can be applied to increase the robustness of the system by defining error margins (bounds) This analysis can be applied to degrade system control and estimation cost, especially for adaptive self-learning systems Namely, within the given error range, the system is considered having a fixed response even if there are a significant oscillations over that range As a result, adaptive self-learning cost can be reduced significantly However, the derivation of these 140 CHAPTER CONCLUSIONS AND RECOMMENDATIONS bounds requires a significant mathematical and system analysis Hence, computational cost of our cacheawareadaptive closed loop schedulingframework can be decreased by using these mathematical tools 7.2.4 Theoretical Analysis of the SchedulingFramework As stated in previous chapters Chapter and Chapter 5, algorithmic complexities of QR-RLS and pole placement controller design algorithms are Θ(n2 ) and Θ(n3 ), respectively Based on this fact, the overall framework’s computational complexity is calculated as max(Θ(n3 ), Θ(n2 )) = Θ(n3 ) Here, n refers to the system order; i.e the number of past measurement samples used in derivation of current system states In the thesis, discussions on the theoretical analysis of overall framework and individual algorithms are not comprehensive due to the limited scope and time frame of our Master research project A potential future work is a comprehensive theoretical analysis of these proposed algorithms; algebraic pole placement, QR Recursive Least Square (RLS) This analysis includes complexity and stability analysis of the algorithms as well as the overall schedulingframework QR-RLS Algorithm The Fast QR-RLS algorithm is a potential algorithm, which significantly improves the complexity of online learning or parameter identification module from Θ(n2 ) to Θ(n) Pole Placement Algorithm The Pole placement algorithm has a high computational complexity compared to the QR-RLS algorithm; hence, the placement algorithm can be considered as a computational bottleneck of the overall framework Further theoretical analysis and different computational approaches are necessary to reduce the complexity of the pole placement algorithm Literature Cited Anderson, J H and Srinivasan, A (1999) A new look at pfair priorities Technical report, In Submission Anderson, J H and Srinivasan, A (2000) Early release fair scheduling 12th Euromicro Conference on Real-Time Systems, 0:35 Anthes, G H (2000) Cache memory Computerworld, 34(14):62 Apolinario, J A J and Miranda, M D (2009) QRD-RLS Adaptive Filtering Springer Science+Business Media, LLC Astrom, K J and Wittenmark, B (1994) Adaptive Control Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA Baruah, S., Gehrke, J., and Plaxton, C (1995) Fast scheduling of periodic tasks on multiple resources In Proceedings 9th International Parallel Processing Symposium, 1995, pages 280 –288 Baruah, S K., Cohen, N K., Plaxton, C G., and Varvel, D A (1996) Proportionate progress: A notion of fairness in resource allocation Algorithmica, 15:600–625 Bertogna, M., Cirinei, M., and Lipari, G (2009) Schedulability analysis of global scheduling algorithms on multiprocessor platforms IEEE Trans Parallel Distrib Syst., 20(4):553–566 Bobal, V., Bohm, J., and amd Jiri Machacek, J F (2005) Digital Self-tuning Controllers Springer-Verlag Bohlin, T (2006) Practical Grey-box Process Identification: Theory and Applications (Advances in Industrial Control) Springer-Verlag New York, Inc., Secaucus, NJ, USA 141 142 LITERATURE CITED Brinkschulte, U and Pacher, M (2008) A control theory approach to improve the real-time capability of multi-threaded microprocessors In ISORC ’08: Proceedings of the 2008 11th IEEE Symposium on Object Oriented Real-Time Distributed Computing, pages 399–404, Washington, DC, USA IEEE Computer Society Camacho, E F and Bordons, C (1999) Model Predictive Control Springer-Verlag Corporation, M (2003) Msdn library ASP.net Cottet, F., Delacroix, J., Kaiser, C., and Mammeri, Z (2002) Scheduling in Real-Time Systems John Wiley & Sons, Chichester Diniz, P S (2008) Adaptive Filtering Algorithms and Practical Implementation Springer Science+Business Media, LLC Ebrahimi, E., Lee, C J., Mutlu, O., and Patt, Y N (2010) Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems In ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, pages 335–346, New York, NY, USA ACM Elliott, P D (2009) Bilinear Control Systems Matrices in Action, volume 169 of Applied Mathematical Sciences Springer Science+Business Media Fedorova, A., Seltzer, M., and Smith, M D (2006) Cache-fair thread scheduling for multicore processors Technical report, Harvard University Green, M and Limebeer, D J N (1995) Linear robust control Prentice-Hall, Inc., Upper Saddle River, NJ, USA Intel (2006) Dual-Core Update to the Intel Itanium Processor Reference Manual Intel Corporation Ioannou, P and Fidan, B (2006) Adaptive Control Tutorial, volume 11 of Advances in Design and Control Society for Industrial and Applied Mathematics Itzkovitz, A., Schuster, A., and Shalev, L (1998) Thread migration and its applications in distributed shared memory systems Journal of Systems and Software, 42(1):71 – 87 143 LITERATURE CITED Jahre, M and Natvig, L (2009) A light-weight fairness mechanism for chip multiprocessor memory systems In CF ’09: Proceedings of the 6th ACM conference on Computing frontiers, pages 1–10, New York, NY, USA ACM Jain, R., Chiu, D.-M., and Hawe, W (1998) A quantitative measure of fairness and discrimination for resource allocation in shared computer systems CoRR, cs.NI/9809099:1– 32 Kent, A and Williams, J G (1997) Encyclopedia of Computer Science and Technology: Supplement 21, volume 36 of Encyclopedia of Computer Science and Technology CRC Press Kim, S., Chandra, D., and Solihin, Y (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture In PACT ’04: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 111–122, Washington, DC, USA IEEE Computer Society KleinOsowski, A J and Lilja, D J (2002) Minnespec: A new spec benchmark workload for simulation-based computer architecture research IEEE Comput Archit Lett., 1:7– Kroft, D (1981) Lockup-free instruction fetch/prefetch cache organization In ISCA ’81: Proceedings of the 8th annual symposium on Computer Architecture, pages 81–87, Los Alamitos, CA, USA IEEE Computer Society Press Ljung, L (1998) System Identification: Theory for the User (2nd Edition) Prentice Hall PTR Mak, P., Blake, M A., Jones, C C., Strait, G E., and Turgeon, P R (1997) Shared-cache clusters in a system with a fully shared memory IBM J Res Dev., 41(4-5):429–448 Matick, R E., Heller, T J., and Ignatowski, M (2001) Analytical analysis of finite cache penalty and cycles per instruction of a multiprocessor memory hierarchy using miss rates and queuing theory IBM J Res Dev., 45(6):819–842 Microsystems, S (2006) OpenSPARC T1 Microarchitecture Specication Sun Microsystems, SunMicrosystems, Inc., 4150Network Circle, Santa Clara, California 95054,U.S.A, first edition 144 LITERATURE CITED Ogata, K (1997) Modern control engineering (3rd ed.) Prentice-Hall, Inc., Upper Saddle River, NJ, USA Olukotun, K (2007) Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency Morgan and Claypool Publishers Paraskevopoulos, P (2002) Modern Control Engineering Control Series Marcel Dekker, Inc., 270 Madison Avenue, New York, NY 10016 Pfenning, F and Barbic, J (2007) Multi-core architecture Siddha, S., Pallipadi, V., and Mallick, A (2007) Process scheduling challenges in the era of multi-core processors Intel Technology Journal, 11(4):361–370 SiliconGraphicsLibrary (2000) Origin2000 and onyx2 performance tuning and optimization guide HTML Srikantaiah, S., Kandemir, M., and Wang, Q (2009) Sharp control: controlled shared cache management in chip multiprocessors In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 517–528, New York, NY, USA ACM Stacpoole, R and Jamil, T (2000) Cache memories Potentials, IEEE, 19(2):24 –29 Stallings, W (2003) Computer Organization and Architecture: Designing for Performance (7th Edition) Prentice-Hall, Inc., Upper Saddle River, NJ, USA Suh, G E., Devadas, S., and Rudolph, L (2001) Analytical cache models with applications to cache partitioning In ICS ’01: Proceedings of the 15th international conference on Supercomputing, pages 1–12, New York, NY, USA ACM Tam, D K., Azimi, R., Soares, L B., and Stumm, M (2009) Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations In ASPLOS ’09: Proceeding of the 14th international conference on Architectural support for programming languages and operating systems, pages 121–132, New York, NY, USA ACM Tanenbaum, A S (2005) Structured Computer Organization (5th Edition) Prentice Hall, edition 145 LITERATURE CITED Thimmannagari, C (2008) Cpu Design: Answers to Frequently Asked Questions Springer Publishing Company, Incorporated Velusamy, S., Sankaranarayanan, K., Parikh, D., Abdelzaher, T., and Skadron, K (2002) Adaptivecache decay using formal feedback control In In Proceedings of the 2002 Workshop on Memory Performance Issues Vetter, S., Filhol, B., Kim, S., Linzmeier, G., and Plachy, O (2006) Ibm system p5 quad-core module based on power5+ technology Technical overview and introduction, IBM Corp, BM Corporation, International Technical Support Organization , Dept JN9B Building 905, 11501 Burnet Road a,Austin, Texas 78758-3493 U.S.A Villani, P (2001) Programming Win32 under the API CMP Books, CMP Media, Inc., Publishers Group West, 1700 Fourth Street, Berkley, CA 94710 Zhou, B., Qiao, J., and Lin, S (2009a) Research on synthesis parameter real-time scheduling algorithm on multi-core architecture In CCDC’09: Proceedings of the 21st annual international conference on Chinese control and decision conference, pages 5152–5156, Piscataway, NJ, USA IEEE Press Zhou, X., Chen, W., and Zheng, W (2009b) Cache sharing management for performance fairness in chip multiprocessors In PACT ’09: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, pages 384– 393, Washington, DC, USA IEEE Computer Society 146 LITERATURE CITED ... control action The adaptive self-tuning control framework and cache aware scheduling system in fact constitute our final framework, closed loop cache aware adaptive scheduling framework The third... Level Multiprocessor Scheduling 27 2.4.1 Cache- Fair Multi-Core CMP Scheduling 27 2.4.2 Adaptive Cache- Aware CMP Scheduling 40 Modern Control Theory for Scheduling. .. also estimated with 75% accuracy vi Keywords Multiprocessor Scheduling, Adaptive Control Theory, Recursive Least Square, Cache- Aware Adaptive Scheduling Framework vii viii Acknowledgments I gratefully