Hypothesis Testing for Two Means and Two Proportions tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập lớ...
Real Interest Rate Linkages: Testing for Common Trends and Cycles Darren Pain* and Ryland Thomas* * Bank of England, Threadneedle Street, London, EC2R 8AH. The views expressed are those of the authors and do not necessarily reflect those of the Bank of England. We would like to thank Clive Briault, Andy Haldane, Paul Fisher, Nigel Jenkinson, Mervyn King and Danny Quah for helpful comments and Martin Cleaves for excellent research assistance. Issued by the Bank of England, London, EC2R 8AH to which requests for individual copies should be addressed: envelopes should be marked for the attention of the Publications Group (Telephone: 0171-601 4030). Bank of England 1997 ISSN 1368-5562 2 3 Contents Abstract 5 Introduction 7 I Common trends and cycles - econometric theory and method9 II Empirical results 17 III European short rates 22 IV Long-term real interest rates in the G3 31 V Conclusion 35 References 36 4 5 Abstract This paper formed part of the Bank of England’s contribution to a study by the G10 Deputies on saving, investment and real interest rates, see Jenkinson (1996). It investigates the existence of common trends and common cycles in the movements of industrial countries’ real interest rates. Real interest rate movements are decomposed into a trend (random walk) element and a cyclical (stationary moving average) element using the Beveridge-Nelson decomposition. We then derive a common trends and cycles representation using the familiar theory of cointegration and the more recent theory of cofeatures developed by Vahid and Engle (1993). We consider linkages between European short-term real interest rates. Here there is evidence of German leadership/dominance - we cannot reject the hypothesis that the German real interest rate is the single common trend and that the two common cycles are represented by the spreads of French and UK rates over German rates. The single common trend remains when the United States (as representative of overseas rates) is added to the system , but German leadership is rejected in favour of US (overseas) leadership. We also find the existence of a single common trend in G3 rates after 1980. 6 7 Introduction Real interest rates lie at the heart of the transmission mechanism of monetary policy. Increasingly attention has been paid to how different countries’ real interest rates interact and how this interaction has developed through time. Economic theory would suggest that in a world where capital is perfectly mobile and real exchange rates converge to their equilibrium levels, ex-ante real interest rates (ie interest rates less the expected rate of inflation across the maturity of the asset) should move together in the long run. (1) The extent to which they move together in practice may therefore shed some light on either the degree of capital mobility or real exchange rate convergence, see Haldane and Pradhan (1992). For instance the increasing liberalisation of domestic capital markets during the 1980s would be expected to have strengthened the link among different countries’ real interest rates in this period. The aim of this paper is to investigate statistically the degree to which real interest rates have moved together both in the long run and over the cycle. Specifically we test for the existence of common ‘trends’ and ‘cycles’ in real interest rates for particular groups of countries, using familiar cointegration analysis and the more recent common feature techniques developed by Engle and Vahid (1993). We Hypothesis Testing for Two Means and Two Proportions Hypothesis Testing for Two Means and Two Proportions By: OpenStaxCollege Hypothesis Testing for Two Means and Two Proportions Class Time: Names: Student Learning Outcomes • The student will select the appropriate distributions to use in each case • The student will conduct hypothesis tests and interpret the results Supplies: • the business section from two consecutive days’ newspapers • three small packages of M&Ms® • five small packages of Reese's Pieces® Increasing Stocks Survey Look at yesterday’s newspaper business section Conduct a hypothesis test to determine if the proportion of New York Stock Exchange (NYSE) stocks that increased is greater than the proportion of NASDAQ stocks that increased As randomly as possible, choose 40 NYSE stocks, and 32 NASDAQ stocks and complete the following statements H0: _ Ha: _ In words, define the random variable The distribution to use for the test is _ Calculate the test statistic using your data Draw a graph and label it appropriately Shade the actual level of significance Graph: 1/3 Hypothesis Testing for Two Means and Two Proportions Calculate the p-value Do you reject or not reject the null hypothesis? Why? Write a clear conclusion using a complete sentence Decreasing Stocks Survey Randomly pick eight stocks from the newspaper Using two consecutive days’ business sections, test whether the stocks went down, on average, for the second day H0: Ha: In words, define the random variable The distribution to use for the test is _ Calculate the test statistic using your data Draw a graph and label it appropriately Shade the actual level of significance Graph: Calculate the p-value: Do you reject or not reject the null hypothesis? Why? Write a clear conclusion using a complete sentence Candy Survey Buy three small packages of M&Ms and five small packages of Reese's Pieces (same net weight as the M&Ms) Test whether or not the mean number of candy pieces per package is the same for the two brands H0: 2/3 Hypothesis Testing for Two Means and Two Proportions Ha: In words, define the random variable What distribution should be used for this test? Calculate the test statistic using your data Draw a graph and label it appropriately Shade the actual level of significance Graph: Calculate the p-value Do you reject or not reject the null hypothesis? Why? Write a clear conclusion using a complete sentence Shoe Survey Test whether women have, on average, more pairs of shoes than men Include all forms of sneakers, shoes, sandals, and boots Use your class as the sample H0: Ha: In words, define the random variable The distribution to use for the test is Calculate the test statistic using your data Draw a graph and label it appropriately Shade the actual level of significance Graph: Calculate the p-value Do you reject or not reject the null hypothesis? Why? Write a clear conclusion using a complete sentence 3/3 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, pages 176–181, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability Jonathan H. Clark Chris Dyer Alon Lavie Noah A. Smith Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jhclark,cdyer,alavie,nasmith}@cs.cmu.edu Abstract In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference al- gorithm) improves translation quality in com- parison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimen- tal outcomes, and make recommendations for reporting results more accurately. 1 Introduction The need for statistical hypothesis testing for ma- chine translation (MT) has been acknowledged since at least Och (2003). In that work, the proposed method was based on bootstrap resampling and was designed to improve the statistical reliability of re- sults by controlling for randomness across test sets. However, there is no consistently used strategy that controls for the effects of unstable estimates of model parameters. 1 While the existence of opti- mizer instability is an acknowledged problem, it is only infrequently discussed in relation to the relia- bility of experimental results, and, to our knowledge, there has yet to be a systematic study of its effects on 1 We hypothesize that the convention of “trusting” BLEU score improvements of, e.g., > 1, is not merely due to an ap- preciation of what qualitative difference a particular quantita- tive improvement will have, but also an implicit awareness that current methodology leads to results that are not consistently reproducible. hypothesis testing. In this paper, we present a series of experiments demonstrating that optimizer insta- bility can account for substantial amount of variation in translation quality, 2 which, if not controlled for, could lead to incorrect conclusions. We then show that it is possible to control for this variable with a high degree of confidence with only a few replica- tions of the experiment and conclude by suggesting new best practices for significance testing for ma- chine translation. 2 Nondeterminism and Other Optimization Pitfalls Statistical machine translation systems consist of a model whose parameters are estimated to maximize some objective function on a set of development data. Because the standard objectives (e.g., 1-best BLEU, expected BLEU, marginal likelihood) are not convex, only approximate solutions to the op- timization problem are available, and the parame- ters learned are typically only locally optimal and may strongly depend on parameter initialization and search hyperparameters. Additionally, stochastic optimization and search techniques, such as mini- mum error rate training (Och, 2003) and Markov chain Monte Carlo methods (Arun et al., 2010), 3 constitute a second, more obvious source of noise in the optimization procedure. This variation in the parameter vector affects the quality of the model measured on both development 2 This variation directly affects the output translations, and so it will propagate to both automated metrics as well as human evaluators. 3 Online subgradient techniques such as MIRA (Crammer et al., 2006; Chiang et al., 2008) have an implicit stochastic com- ponent as well based on the order of Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, pages 182–187, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics Bayesian Word Alignment for Statistical Machine Translation Cos¸kun Mermer 1,2 1 BILGEM TUBITAK Gebze 41470 Kocaeli, Turkey coskun@uekae.tubitak.gov.tr Murat Sarac¸lar 2 2 Electrical and Electronics Eng. Dept. Bogazici University Bebek 34342 Istanbul, Turkey murat.saraclar@boun.edu.tr Abstract In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possi- ble parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. 1 Introduction Word alignment is a crucial early step in the training of most statistical machine translation (SMT) sys- tems, in which the estimated alignments are used for constraining the set of candidates in phrase/grammar extraction (Koehn et al., 2003; Chiang, 2007; Galley et al., 2006). State-of-the-art word alignment mod- els, such as IBM Models (Brown et al., 1993), HMM (Vogel et al., 1996), and the jointly-trained symmet- ric HMM (Liang et al., 2006), contain a large num- ber of parameters (e.g., word translation probabili- ties) that need to be estimated in addition to the de- sired hidden alignment variables. The most common method of inference in such models is expectation-maximization (EM) (Demp- ster et al., 1977) or an approximation to EM when exact EM is intractable. However, being a maxi- mization (e.g., maximum likelihood (ML) or max- imum a posteriori (MAP)) technique, EM is gen- erally prone to local optima and overfitting. In essence, the alignment distribution obtained via EM takes into account only the most likely point in the parameter space, but does not consider contributions from other points. Problems with the standard EM estimation of IBM Model 1 was pointed out by Moore (2004) and a number of heuristic changes to the estimation pro- cedure, such as smoothing the parameter estimates, were shown to reduce the alignment error rate, but the effects on translation performance was not re- ported. Zhao and Xing (2006) note that the param- eter estimation (for which they use variational EM) suffers from data sparsity and use symmetric Dirich- let priors, but they find the MAP solution. Bayesian inference, the approach in this paper, have recently been applied to several unsupervised learning problems in NLP (Goldwater and Griffiths, 2007; Johnson et al., 2007) as well as to other tasks in SMT such as synchronous grammar induction (Blunsom et al., 2009) and learning phrase align- ments directly (DeNero et al., 2008). Word alignment learning problem was addressed jointly with segmentation learning in Xu et al. (2008), Nguyen et al. (2010), and Chung and Gildea (2009). The former two works place nonparametric priors (also known as cache models) on the param- eters and utilize Gibbs sampling. However, align- ment inference in neither of these works is exactly Bayesian since the alignments are updated by run- ning GIZA++ (Xu et al., 2008) or by local maxi- mization (Nguyen et al., 2010). On the other hand, 182 Chung and Gildea (2009) apply a sparse Dirichlet prior on the multinomial parameters to prevent over- fitting. They use variational Bayes for inference, but they do not investigate the effect of Bayesian infer- ence to word alignment in isolation. Recently, Zhao and Gildea (2010) proposed fertility extensions to IBM BioMed Central Page 1 of 8 (page number not for citation purposes) Journal of NeuroEngineering and Rehabilitation Open Access Methodology Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection Patricia Besson* and Murat Kunt Address: Signal Processing Institute (ITS), Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland Email: Patricia Besson* - patricia.besson@univmed.fr; Murat Kunt - murat.kunt@epfl.ch * Corresponding author Abstract Background: Speaker detection is an important component of many human-computer interaction applications, like for example, multimedia indexing, or ambient intelligent systems. This work addresses the problem of detecting the current speaker in audio-visual sequences. The detector performs with few and simple material since a single camera and microphone meets the needs. Method: A multimodal pattern recognition framework is proposed, with solutions provided for each step of the process, namely, the feature generation and extraction steps, the classification, and the evaluation of the system performance. The decision is based on the estimation of the synchrony between the audio and the video signals. Prior to the classification, an information theoretic framework is applied to extract optimized audio features using video information. The classification step is then defined through a hypothesis testing framework in order to get confidence levels associated to the classifier outputs, allowing thereby an evaluation of the performance of the whole multimodal pattern recognition system. Results: Through the hypothesis testing approach, the classifier performance can be given as a ratio of detection to false-alarm probabilities. Above all, the hypothesis tests give means for measuring the whole pattern recognition process effciency. In particular, the gain offered by the proposed feature extraction step can be evaluated. As a result, it is shown that introducing such a feature extraction step increases the ability of the classifier to produce good relative instance scores, and therefore, the performance of the pattern recognition process. Conclusion: The powerful capacities of hypothesis tests as an evaluation tool are exploited to assess the performance of a multimodal pattern recognition process. In particular, the advantage of performing or not a feature extraction step prior to the classification is evaluated. Although the proposed framework is used here for detecting the speaker in audiovisual sequences, it could be applied to any other classification task involving two spatio-temporal co-occurring signals. Background Speaker detection is an important component of many human-computer interaction applications, like for exam- ple, multimedia indexing, or ambient intelligent systems (through the use of speech-based user-interfaces). Recent and reliable speech recognition methods rely indeed on both acoustic and visual cues to perform [1]. They require therefore the speaker to be identified and discriminated Published: 27 March 2008 Journal of NeuroEngineering and Rehabilitation 2008, 5:11 doi:10.1186/1743-0003-5-11 Received: 7 February 2007 Accepted: 27 March 2008 This article is available from: http://www.jneuroengrehab.com/content/5/1/11 © 2008 Besson and Kunt; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits STUDY PROTO C O L Open Access The efficacy of computer reminders on external quality assessment for point-of-care testing in Danish general practice: rationale and methodology for two randomized trials Frans B Waldorff 1* , Volkert Siersma 1 , Ruth Ertmann 2 , Marius Brostrøm Kousgaard 1 , Anette Sonne Nielsen 4 , Peter Felding 3 , Niels Mosbæk 3 , Else Hjortsø 4 and Susanne Reventlow 1 Abstract Background: Point-of-care testing (POCT) is increasingly being used in general practice to assist general practitioners (GPs) in their management of patients with diseases. However, low adherence to quality guidelines in terms of split test procedures has been observed among GPs in parts of the Capital Region in Denmark. Computer reminders embedded in GPs electronic medical records (ComRem) may facilitate improved quality control behaviour, but more research is needed to identify what types of reminders work and when. The overall aim of this study is to evaluate the efficacy of ComRem to improve GPs adherence to quality guidelines. This article describes the rationale and methods of the study that constitute this research project. Methods/design: The study is conducted as two randomised controlled trials (RCTs) among general practices in two districts of the Capital Region in Denmark. These districts contain a total of 739 GPs in 567 practices with a total of 1.1 million patients allocated to practice lists. In the first RCT (RCT A), ComRem is compared to postal reminder letters. In the second RCT (RCT B), ComRem is compared to usual activities (no reminders) with a crossover approach. In both of these studies, outcomes are measured by the number of split tests received by the laboratory. Conclusions: This study will contribute to knowledge on the efficacy of ComRem in primary care. Because the study does not explore GPs’ perceptions and experiences with regard to ComRem, we will subsequently conduct a qualitative survey focusing on these aspects. Trial registrations: Study A: ClinicalTrials.gov identifier: NCT01152151 Study B: ClinicalTrials.gov identifier: NCT01152177 Background Point-of-care testing (POCT) is increasingly being used in general practice to assist general practitioners (GPs) in their daily work with patients. For adequate deployment of POCT, an external quality assessment (EQA) is recommended on a monthly basis [1]. In the Copenhagen area, EQA is enforced by a split test procedure as well as annual outreach consultant visits. In a split test, the actual POCT result is compared to a result fro m a blood sample from the same individual analyzed at the central laboratory. The quotient of these two results should ide- ally be 1.00, but a value inside the interval ranging from 0.85 to 1.15 is acceptable [1]. This quotient is returned to the practice for self-evaluation. However, the adherence to the monthly split test procedure has not been satisfac- tory among GPs in two districts of the Capital Region (Table 1). T herefore, the Copenhagen General Pract i- tioners’ Laboratory (hereafter simply referred to as ‘the laboratory’) planned to improve adherence. * Correspondence: fransw@sund.ku.dk 1 The Research Unit for General Practice and Section of General Practice, Department of Public Health, University of Copenhagen, Copenhagen, Denmark Full list of author information is available at the end of the article Waldorff et al. Implementation Science 2011, 6:79 http://www.implementationscience.com/content/6/1/79 Implementation Science © 2011 Waldorff et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/l icenses/by/2.0 ), whic h permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Dissemination of guidelines alone rarely brings about improvements in clinical practice [2], and e ven an mul- tifaceted implementation of guidelines may not change clinical ... the mean number of candy pieces per package is the same for the two brands H0: 2/3 Hypothesis Testing for Two Means and Two Proportions Ha: In words, define the random variable What.. .Hypothesis Testing for Two Means and Two Proportions Calculate the p-value Do you reject or not reject the null hypothesis? Why? Write a clear conclusion... Ha: In words, define the random variable The distribution to use for the test is _ Calculate the test statistic using your data Draw a graph and label it appropriately Shade the