1. Trang chủ
  2. » Công Nghệ Thông Tin

QuantNet – A Database-Driven Online Repository of Scientific Information potx

46 839 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 46
Dung lượng 8,18 MB

Nội dung

QuantNet A Database-Driven Online Repository of Scientific Information A Master’s Thesis Presented by Anton Andriyashin (188779) to Prof. Dr. Wolfgang H¨ardle CASE Center of Applied Statistics and Economics Humboldt University, Berlin in partial fulfillment of the requirements for the degree of Master of Science Berlin, June 20, 2007 Declaration of Authorship I hereby confirm that I have authored this Master’s thesis independently and without use of others than the indicated resources. All passages, which are literally or in general matter taken out of publications or other resources, are marked as such. Anton Andriyashin Berlin, June 20, 2007 1 Contents 1 Introduction 4 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 QuantNet: A Look Inside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 An Online Repository of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 What Is Wrong With Regular HTML Publishing? . . . . . . . . . . . . . . . . . . . 11 2 Single Document Setup 13 2.1 Typical Structure of a Submitted ASCII File . . . . . . . . . . . . . . . . . . . . . . 13 2.2 What is XML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 XML and XSLT A Single Document in HTML . . . . . . . . . . . . . . . . . . . . 19 2.4 ASCII to XML: Atox and XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Multiple Documents Setup 26 3.1 From a Single Document to Multiple Documents . . . . . . . . . . . . . . . . . . . . 26 3.2 mySQL and PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Javascript, CSS and PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Putting Everything Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Possibilities of QuantNet’s Core 38 2 4.1 Scalability User-defined Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Ease of Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Ways to Make QuantNet Even More Powerful . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 1 Introduction 1.1 Motivation Many sociologists consider the XXI century to be the true beginning of the new information era. With the amount of information generated every day and the share of that information being represented in the World Wide Web, it will not be an exaggeration to say that online media become at least as important as their paper-based counterparts. Already in the 1980s the OECD realized the importance of information as an asset in the global economy [10; 11] and has been using the definition of Porrat for the indication of information economy [12] as the one, where at least 50% of the GNP is produced in the so called primary or secondary information sectors, i.e. sectors that employ information goods and services directly in the production, distribution or information processing, or information services produced for internal consumption by companies, which do not produce information for sell, and by government [7]. Nowadays information becomes one of the most valuable assets in the world economy. New information technologies are able to broaden the horizons and tackle the traditional chal- lenges in unexpected ways consider, for instance, the Hypertext Markup Language (HTML). Its first published specification was drafted by Berners-Lee with Dan Connolly and was published in 1993 by the IETF [5], and already in 2000 HTML became an international standard (ISO/IEC 15445:2000 ). This language offered the new way of content navigation by having a possibility to switch quickly author-defined parts of the entity via so called hyperlinks. Having realized many advantages of modern IT, many ”offline” journals and magazines established online presence with unique features of delivering the information that paper-based editions lack. High-resolution photo- graphic materials, video content, audio podcasts, quick search, archives and links to other resources – these are just a few of comme il faut elements that almost any high-end online edition has to offer today. In the last 15 years HTML becam e the corner-stone of the whole Internet that changed the living style of the majority of people on the planet. 4 And HTML is just one example. New markup languages like the Extensible Markup Language (XML) and its derivatives (e.g. VoiceXML) structure only data and do not contain direct style instructions (as opposed to, for instance, font color and many other similar properties in HTML). Along with Extensible Stylesheet Language Transformations (XSLT), XML is able to be transformed into another XML document, ASCII, HTML or even PDF file. Therefore, it is not for nothing that many web sites employ the data-driven approach of construction and are able to provide automatic up dates based on new incoming information in real time. Just imagine some online weather forecast service that receives remote data (in XML format) from a research center and is able to update the site almost with no delay. If HTML were used instead of XML and XSLT, the up date would take much longer due to the manual corrections inside HTML code required. And since XML is now supported by many prominent software applications, e.g. Oracle, Informix, mySQL or Microsoft Office, not to mention the support of XML by many programming languages like PHP, Python and Perl, the potential for employment of XML in the Internet is enormous. A substantial portion of scientific knowledge produced is later presented online to s hare new ideas with colleagues all over the world. Las t decade clearly showed the importance of Internet presence, resulting in the emergence of different citation index systems like ISI, Scopus, CiteSeer, RePeC, Google Scholar and others. But if the aptitude towards online presence is quite clear, it is not always clear how to present that information, since substantial technical difficulties can arise while e stablishing one’s own online content sys tem, e.g. in the framework of a research institution. The aim of this work is to provide a semi-automated core called QuantNet allowing to publish significant amounts of scientific information online in the situation when regular updates are implied and, most importantly, when the authors of submitted materials are not assumed to be aware of any markup language, i.e. the materials can be submitted as ASCII files with the simplest structure. It is not an ultimate goal of this study to provide a ready-to-ship commercial web application. 5 Instead, the implementation of the core, examination of its possibilities and limitations are of particular interest. At the same time, only minor efforts should later be undertaken to deploy a full-scale online system, basing on the created core. 1.2 QuantNet: A Look Inside Consider a virtual exam ple of a project or a pro c edure that is about to be submitted online via QuantNet. A typical ASCII file could look as follows: 1 @Area SFM 2 @Name Aut ocorr elat ion Plots 3 @Fu nct ion_ cal l SFMacfa r2 () 4 @De scr ipt ion Plots the aut oco rrel ati on f unction of AR (2) proces s 5 @R evisi on 1.2 6 @Author C hrist ian Hafner , 2007 -01 -06 7 8 lag = 30; lag value 9 a1= 0.5 ; value of alpha_1 10 a2= 0.4 ; value of alpha_2 11 input = readv alue (" alpha1 "| " alpha2 "|" lag " , 0.5 |0. 4|30) 12 The fist part of the file contains some general information about the project like its name, author and so on, while the second part may contain a detailed description and/or computer code. As it can be seen from the listing, the ASCII file does not contain any language-specific markup tags. The only tags employed are natural field descriptors, followed by the @ symbol. The author of the submitted document does not have to care about auxiliary properties like font size, color, family and so on. The only thing required is just to follow the sample structure. But what is the next step? How is this ASCII file to be transformed into a well-formed HTML file that ultimately will be rendered by the client’s browser? There are several steps that should 6 be undertaken, but the crucial one is to transform the data from the ASCII file into an XML file that could look as follows: 1 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > <quantlet > 2 <name > Au toc orre lat ion Plots </ name > 3 <area > SFM </area > 4 < function_call > SFMac far2 () </ function_call > 5 <desc > 6 Plots the a uto corr ela tion functio n of AR (2) process 7 </desc > 8 <rev >1.2 </ rev > 9 <author > Christ ian Hafner , 20070106 </ author > 10 </ quantlet > At the same time advanced users of QuantNet are supposed to profit from the maximum amount of possibilities offered by native HTML, XML and XSLT, so inline tags, if present, should be processed adequately. For instance, if the <bold> tag is an allowed one in the ASCII file and stands for the <b> HTML counterpart, then QuantNet should be able to process the following ASCII file adequately: 1 @Area SFM 2 @Name Autoco rre l ati on Plots 3 @Fu nct ion_ cal l SFMacfa r2 () 4 @De scr ipt ion Plots the <bold>autocorrelationfunction</bold> 5 of AR(2) process 6 @R evisi on 1.2 7 @Author C hrist ian Hafner , 2007 -01 -06 8 9 lag = 30; lag value 10 a1= 0.5 ; value of alpha_1 11 a2= 0.4 ; value of alpha_2 12 input = readv alue (" alpha1 "| " alpha2 "|" lag " , 0.5 |0. 4|30) 13 7 The file is screened for the <bold> tag that is substituted with the valid HTML <b> tag, which stands for bold text style. 14 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > <quantlet > 15 <name > Au toc orre lat ion Plots </ name > 16 <area > SFM </area > 17 < function_call > SFMac far2 () </ function_call > 18 <desc > 19 Plots the <b>autocorrelation function</b> of AR (2) process 20 </desc > 21 <rev >1.2 </ rev > 22 <author > Christ ian Hafner , 20070106 </ author > 23 </ quantlet > And, of course, extra tags should by no means be limited only by markup group. In principle even MathML, when supported by the browser (e.g. Mozilla Firefox), should adequately be dis- played inside, say, the <math> tag. That can be very handy for the documents containing a lot of formulas. QuantNet is supposed to deliver such a degree of scalability that almost any HTML tag or their combination could later be defined as simpler and more user-friendly tags allowed for input ASCII files. There exist several solutions that may be helpful in this field, e.g. a lightweight markup language Textile, converting simple ASCII files into well-formed XHTML and allowing some formatting variations [4], or AsciiDoc, aimed at writing short documents in ASCII to be converted in HTML [1]. These tools can be good at solving some specific web-oriented tasks but are not sufficient to build a complete and scalable content system like QuantNet. In this work the representation part of the content is put solely on XSLT while string manipula- tion accounts only for the preparation of necessary raw data files in XML. The logic of Textile, for instance, is employed exactly at this stage while creating XML files out of submitted ASCII files 8 – but with one key difference: no style options to appear later inside HTML code are considered at this stage. Here is a short overview of the study. In the first part of the work, content system administration issues are considered. Section 1.3 analyzes a setup, typical for a research institution with multiple departments with no common IT system. Section 1.4 considers the challenge from the technical point of view, namely through the nature of HTML language. Part 2 focuses of the implementation issues in the single document framework. Section 2.1 in- troduces the structure of ASCII files, recommended by QuantNet. XML is presented in Section 2.2 in the context of a dynamic weather forecasting web application. Section 2.3 focuses on the con- junction of XSLT and XML as well as discusses multiple document web application implementation issues, if HTML were the only language employed. Part 3 primarily concerns the multiple document nature of QuantNet. Section 3.1 provides a short ove rview of implementation tools necessary to deploy the web application of this type. Section 3.2 introduces mySQL a popular database management system for online applications – and PHP a scripting language that is mostly used at the server side. Later, in Section 3.3, the motivation for Javascript as a client-sided scripting language in conjunction with PHP and CSS is provided. A step-by-step overview of the implementation of QuantNet, available in Section 3.4, concludes Part 3. Finally, Part 4 focuses on the potential of QuantNet as a scalable web application. Section 4.1 concentrates on the ability of QuantNet to handle potentially unlimited amount of additional tags in ASCII files. Several useful applications of this feature as well as the implementation logic are provided there. In Section 4.2 the process of adding a new project to QuantNet or the change of the application’s content structure is considered. Validation by means of XML schemas and analytic grammar, based on Backus-Naur form, as well as scripting are discussed in Section 4.3. 9 [...]...1.3 An Online Repository of Information A typical application of QuantNet could be an online interdisciplinary repository of research materials submitted by various parties from professional researchers to university students These materials could contain not only results and algorithm descriptions, which is a traditional form of almost any publication, but also source codes, when available, as well... employ for a certain heading unless he or she is well aware of specific HTML tags to take advantage of 10 In this sense the submitted ASCII files normally are to contain only data and minimum amount (or no) markup tags This is the fundamental feature of QuantNet a user supplies a structured data file, and QuantNet semi-automatically processes this file and incorporates it in the proper cell of the system:... may be clear what advantages provides a submission of a research study as the data ASCII file for a person who is not aware of HTML for online publication, several administration aspects, which may be not so obvious, are worth mentioning here Suppose the author of the project to be published online has the file in HTML format Does that automatically mean that this person is aware of HTML? Not necessarily... functioning of QuantNet as a complex web application In this section let us review the major stages of QuantNet As it was mentioned before, the primary aim of QuantNet is to provide the way for easy web publishing even for those people who are not so familiar with necessary technical details For those, who are well aware of HTML and other markup languages, QuantNet should be able to provide additional optional... controls via extra available tags to be used in the submitted ASCII files With potentially large amount of documents, QuantNet should be easy to maintain as the web application, it should have a modular structure and be scalable And, of course, that should be a modern application with suitable graphic elements of the user interface All these prerequisites led to the choice of a particular tool to tackle... constituted by a fixed set of allowed tags that are in charge of proper representation of text and graphics on the web page XML file can contain any used-defined tag, in fact there are no predefined tags at all And how is that possible for a language? Since the aim of XML is just to structure the information and not to display it, this approach is quite natural because it is impossible to make predefined templates... The names of these documents (XML files) are stored in a simple mySQL database along with other suitable properties like project a liation in terms of the area and so on, refer to Table 3 for details The database provides the administrator not only with an easy way to maintain the logical structure of QuantNet, but also it is an important part while the creation of the dynamic navigation menu and possibly... 2.2 What is XML? XML the Extensible Markup Language is a markup language with user-defined tags used for information management While HTML is another markup language and XML files are used to 14 Figure 2: BBC Weather web page: generation of dynamic HTML out of raw XML files with weather forecast data create HTML output, there are several noticeable differences between these two languages First and foremost,... has no parents (usually area names) For internal maintaining purposes only Table 3: Fields employed by QuantNet in the mySQL database If the projects submitted to QuantNet are maintained as an hierarchical structure, i.e each project belongs to a specific area (refer to the @area tag in Section 2.4), this setup can be easily replicated in a quite simple database From there it is possible to build a. .. challenges At the very first step, submitted ASCII files are processed by Atox and later post-processed by an additional XSLT template Available field names in ASCII files, extra allowed tags for user formatting and end of field character(s) are three most important factors to influence the work of Atox As a result of these operations, each ASCII file is translated into a well-formed XML document The names . QuantNet – A Database-Driven Online Repository of Scientific Information A Master’s Thesis Presented by Anton Andriyashin (188779) to Prof. Dr. Wolfgang. -06 7 8 lag = 30; lag value 9 a1 = 0.5 ; value of alpha_1 10 a2 = 0.4 ; value of alpha_2 11 input = readv alue (" alpha1 "| " alpha2 "|"

Ngày đăng: 07/03/2014, 23:20

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN