Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 46 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
46
Dung lượng
8,18 MB
Nội dung
QuantNet–ADatabase-Driven Online
Repository ofScientific Information
A Master’s Thesis Presented
by
Anton Andriyashin
(188779)
to
Prof. Dr. Wolfgang H¨ardle
CASE – Center of Applied Statistics and Economics
Humboldt University, Berlin
in partial fulfillment of the requirements
for the degree of
Master of Science
Berlin, June 20, 2007
Declaration of Authorship
I hereby confirm that I have authored this Master’s thesis independently and without use of others
than the indicated resources. All passages, which are literally or in general matter taken out of
publications or other resources, are marked as such.
Anton Andriyashin
Berlin, June 20, 2007
1
Contents
1 Introduction 4
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 QuantNet: A Look Inside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 An OnlineRepositoryofInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 What Is Wrong With Regular HTML Publishing? . . . . . . . . . . . . . . . . . . . 11
2 Single Document Setup 13
2.1 Typical Structure ofa Submitted ASCII File . . . . . . . . . . . . . . . . . . . . . . 13
2.2 What is XML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 XML and XSLT –A Single Document in HTML . . . . . . . . . . . . . . . . . . . . 19
2.4 ASCII to XML: Atox and XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Multiple Documents Setup 26
3.1 From a Single Document to Multiple Documents . . . . . . . . . . . . . . . . . . . . 26
3.2 mySQL and PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Javascript, CSS and PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Putting Everything Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Possibilities of QuantNet’s Core 38
2
4.1 Scalability – User-defined Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Ease of Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Ways to Make QuantNet Even More Powerful . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3
1 Introduction
1.1 Motivation
Many sociologists consider the XXI century to be the true beginning of the new information era.
With the amount ofinformation generated every day and the share of that information being
represented in the World Wide Web, it will not be an exaggeration to say that online media
become at least as important as their paper-based counterparts.
Already in the 1980s the OECD realized the importance ofinformation as an asset in the global
economy [10; 11] and has been using the definition of Porrat for the indication of information
economy [12] as the one, where at least 50% of the GNP is produced in the so called primary or
secondary information sectors, i.e. sectors that employ information goods and services directly in
the production, distribution or information processing, or information services produced for internal
consumption by companies, which do not produce information for sell, and by government [7].
Nowadays information becomes one of the most valuable assets in the world economy.
New information technologies are able to broaden the horizons and tackle the traditional chal-
lenges in unexpected ways – consider, for instance, the Hypertext Markup Language (HTML).
Its first published specification was drafted by Berners-Lee with Dan Connolly and was published
in 1993 by the IETF [5], and already in 2000 HTML became an international standard (ISO/IEC
15445:2000 ). This language offered the new way of content navigation by having a possibility to
switch quickly author-defined parts of the entity via so called hyperlinks. Having realized many
advantages of modern IT, many ”offline” journals and magazines established online presence with
unique features of delivering the information that paper-based editions lack. High-resolution photo-
graphic materials, video content, audio podcasts, quick search, archives and links to other resources
– these are just a few of comme il faut elements that almost any high-end online edition has to offer
today. In the last 15 years HTML becam e the corner-stone of the whole Internet that changed the
living style of the majority of people on the planet.
4
And HTML is just one example. New markup languages like the Extensible Markup Language
(XML) and its derivatives (e.g. VoiceXML) structure only data and do not contain direct
style instructions (as opposed to, for instance, font color and many other similar properties in
HTML). Along with Extensible Stylesheet Language Transformations (XSLT), XML is able to
be transformed into another XML document, ASCII, HTML or even PDF file. Therefore, it is
not for nothing that many web sites employ the data-driven approach of construction and are
able to provide automatic up dates based on new incoming information in real time. Just imagine
some online weather forecast service that receives remote data (in XML format) from a research
center and is able to update the site almost with no delay. If HTML were used instead of XML
and XSLT, the up date would take much longer due to the manual corrections inside HTML code
required. And since XML is now supported by many prominent software applications, e.g. Oracle,
Informix, mySQL or Microsoft Office, not to mention the support of XML by many programming
languages like PHP, Python and Perl, the potential for employment of XML in the Internet is
enormous.
A substantial portion ofscientific knowledge produced is later presented online to s hare new
ideas with colleagues all over the world. Las t decade clearly showed the importance of Internet
presence, resulting in the emergence of different citation index systems like ISI, Scopus, CiteSeer,
RePeC, Google Scholar and others. But if the aptitude towards online presence is quite clear, it is
not always clear how to present that information, since substantial technical difficulties can arise
while e stablishing one’s own online content sys tem, e.g. in the framework ofa research institution.
The aim of this work is to provide a semi-automated core called QuantNet allowing to publish
significant amounts ofscientificinformationonline in the situation when regular updates are implied
and, most importantly, when the authors of submitted materials are not assumed to be aware
of any markup language, i.e. the materials can be submitted as ASCII files with the simplest
structure.
It is not an ultimate goal of this study to provide a ready-to-ship commercial web application.
5
Instead, the implementation of the core, examination of its possibilities and limitations are of
particular interest. At the same time, only minor efforts should later be undertaken to deploy a
full-scale online system, basing on the created core.
1.2 QuantNet: A Look Inside
Consider a virtual exam ple ofa project or a pro c edure that is about to be submitted online via
QuantNet. A typical ASCII file could look as follows:
1 @Area SFM
2 @Name Aut ocorr elat ion Plots
3 @Fu nct ion_ cal l SFMacfa r2 ()
4 @De scr ipt ion Plots the aut oco rrel ati on f unction of AR (2) proces s
5 @R evisi on 1.2
6 @Author C hrist ian Hafner , 2007 -01 -06
7
8 lag = 30; lag value
9 a1= 0.5 ; value of alpha_1
10 a2= 0.4 ; value of alpha_2
11 input = readv alue (" alpha1 "| " alpha2 "|" lag " , 0.5 |0. 4|30)
12
The fist part of the file contains some general information about the project like its name,
author and so on, while the second part may contain a detailed description and/or computer code.
As it can be seen from the listing, the ASCII file does not contain any language-specific markup
tags. The only tags employed are natural field descriptors, followed by the @ symbol. The author
of the submitted document does not have to care about auxiliary properties like font size, color,
family and so on. The only thing required is just to follow the sample structure.
But what is the next step? How is this ASCII file to be transformed into a well-formed HTML
file that ultimately will be rendered by the client’s browser? There are several steps that should
6
be undertaken, but the crucial one is to transform the data from the ASCII file into an XML file
that could look as follows:
1 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > <quantlet >
2 <name > Au toc orre lat ion Plots </ name >
3 <area > SFM </area >
4 < function_call > SFMac far2 () </ function_call >
5 <desc >
6 Plots the a uto corr ela tion functio n of AR (2) process
7 </desc >
8 <rev >1.2 </ rev >
9 <author > Christ ian Hafner , 20070106 </ author >
10 </ quantlet >
At the same time advanced users ofQuantNet are supposed to profit from the maximum
amount of possibilities offered by native HTML, XML and XSLT, so inline tags, if present, should
be processed adequately. For instance, if the <bold> tag is an allowed one in the ASCII file and
stands for the <b> HTML counterpart, then QuantNet should be able to process the following
ASCII file adequately:
1 @Area SFM
2 @Name Autoco rre l ati on Plots
3 @Fu nct ion_ cal l SFMacfa r2 ()
4 @De scr ipt ion Plots the <bold>autocorrelationfunction</bold>
5 of AR(2) process
6 @R evisi on 1.2
7 @Author C hrist ian Hafner , 2007 -01 -06
8
9 lag = 30; lag value
10 a1= 0.5 ; value of alpha_1
11 a2= 0.4 ; value of alpha_2
12 input = readv alue (" alpha1 "| " alpha2 "|" lag " , 0.5 |0. 4|30)
13
7
The file is screened for the <bold> tag that is substituted with the valid HTML <b> tag,
which stands for bold text style.
14 <? xml version =" 1.0 " enc oding ="ISO -8859 -1 "? > <quantlet >
15 <name > Au toc orre lat ion Plots </ name >
16 <area > SFM </area >
17 < function_call > SFMac far2 () </ function_call >
18 <desc >
19 Plots the <b>autocorrelation function</b> of AR (2) process
20 </desc >
21 <rev >1.2 </ rev >
22 <author > Christ ian Hafner , 20070106 </ author >
23 </ quantlet >
And, of course, extra tags should by no means be limited only by markup group. In principle
even MathML, when supported by the browser (e.g. Mozilla Firefox), should adequately be dis-
played inside, say, the <math> tag. That can be very handy for the documents containing a lot
of formulas.
QuantNet is supposed to deliver such a degree of scalability that almost any HTML tag or their
combination could later be defined as simpler and more user-friendly tags allowed for input ASCII
files.
There exist several solutions that may be helpful in this field, e.g. a lightweight markup language
Textile, converting simple ASCII files into well-formed XHTML and allowing some formatting
variations [4], or AsciiDoc, aimed at writing short documents in ASCII to be converted in HTML [1].
These tools can be good at solving some specific web-oriented tasks but are not sufficient to build
a complete and scalable content system like QuantNet.
In this work the representation part of the content is put solely on XSLT while string manipula-
tion accounts only for the preparation of necessary raw data files in XML. The logic of Textile, for
instance, is employed exactly at this stage – while creating XML files out of submitted ASCII files
8
– but with one key difference: no style options to appear later inside HTML code are considered
at this stage.
Here is a short overview of the study. In the first part of the work, content system administration
issues are considered. Section 1.3 analyzes a setup, typical for a research institution with multiple
departments with no common IT system. Section 1.4 considers the challenge from the technical
point of view, namely through the nature of HTML language.
Part 2 focuses of the implementation issues in the single document framework. Section 2.1 in-
troduces the structure of ASCII files, recommended by QuantNet. XML is presented in Section 2.2
in the context ofa dynamic weather forecasting web application. Section 2.3 focuses on the con-
junction of XSLT and XML as well as discusses multiple document web application implementation
issues, if HTML were the only language employed.
Part 3 primarily concerns the multiple document nature of QuantNet. Section 3.1 provides
a short ove rview of implementation tools necessary to deploy the web application of this type.
Section 3.2 introduces mySQL –a popular database management system for online applications –
and PHP –a scripting language that is mostly used at the server side. Later, in Section 3.3, the
motivation for Javascript as a client-sided scripting language in conjunction with PHP and CSS
is provided. A step-by-step overview of the implementation of QuantNet, available in Section 3.4,
concludes Part 3.
Finally, Part 4 focuses on the potential ofQuantNet as a scalable web application. Section 4.1
concentrates on the ability ofQuantNet to handle potentially unlimited amount of additional tags
in ASCII files. Several useful applications of this feature as well as the implementation logic are
provided there. In Section 4.2 the process of adding a new project to QuantNet or the change
of the application’s content structure is considered. Validation by means of XML schemas and
analytic grammar, based on Backus-Naur form, as well as scripting are discussed in Section 4.3.
9
[...]...1.3 An OnlineRepositoryofInformationA typical application ofQuantNet could be an online interdisciplinary repositoryof research materials submitted by various parties – from professional researchers to university students These materials could contain not only results and algorithm descriptions, which is a traditional form of almost any publication, but also source codes, when available, as well... employ for a certain heading unless he or she is well aware of specific HTML tags to take advantage of 10 In this sense the submitted ASCII files normally are to contain only data and minimum amount (or no) markup tags This is the fundamental feature ofQuantNet–a user supplies a structured data file, and QuantNet semi-automatically processes this file and incorporates it in the proper cell of the system:... may be clear what advantages provides a submission of a research study as the data ASCII file for a person who is not aware of HTML for online publication, several administration aspects, which may be not so obvious, are worth mentioning here Suppose the author of the project to be published online has the file in HTML format Does that automatically mean that this person is aware of HTML? Not necessarily... functioning ofQuantNet as a complex web application In this section let us review the major stages ofQuantNet As it was mentioned before, the primary aim ofQuantNet is to provide the way for easy web publishing even for those people who are not so familiar with necessary technical details For those, who are well aware of HTML and other markup languages, QuantNet should be able to provide additional optional... controls via extra available tags to be used in the submitted ASCII files With potentially large amount of documents, QuantNet should be easy to maintain as the web application, it should have a modular structure and be scalable And, of course, that should be a modern application with suitable graphic elements of the user interface All these prerequisites led to the choice of a particular tool to tackle... constituted by a fixed set of allowed tags that are in charge of proper representation of text and graphics on the web page XML file can contain any used-defined tag, in fact there are no predefined tags at all And how is that possible for a language? Since the aim of XML is just to structure the information and not to display it, this approach is quite natural because it is impossible to make predefined templates... The names of these documents (XML files) are stored in a simple mySQL database along with other suitable properties like project a liation in terms of the area and so on, refer to Table 3 for details The database provides the administrator not only with an easy way to maintain the logical structure of QuantNet, but also it is an important part while the creation of the dynamic navigation menu and possibly... 2.2 What is XML? XML – the Extensible Markup Language – is a markup language with user-defined tags used for information management While HTML is another markup language and XML files are used to 14 Figure 2: BBC Weather web page: generation of dynamic HTML out of raw XML files with weather forecast data create HTML output, there are several noticeable differences between these two languages First and foremost,... has no parents (usually area names) For internal maintaining purposes only Table 3: Fields employed by QuantNet in the mySQL database If the projects submitted to QuantNet are maintained as an hierarchical structure, i.e each project belongs to a specific area (refer to the @area tag in Section 2.4), this setup can be easily replicated in a quite simple database From there it is possible to build a. .. challenges At the very first step, submitted ASCII files are processed by Atox and later post-processed by an additional XSLT template Available field names in ASCII files, extra allowed tags for user formatting and end of field character(s) are three most important factors to influence the work of Atox As a result of these operations, each ASCII file is translated into a well-formed XML document The names . QuantNet – A Database-Driven Online
Repository of Scientific Information
A Master’s Thesis Presented
by
Anton Andriyashin
(188779)
to
Prof. Dr. Wolfgang. -06
7
8 lag = 30; lag value
9 a1 = 0.5 ; value of alpha_1
10 a2 = 0.4 ; value of alpha_2
11 input = readv alue (" alpha1 "| " alpha2 "|"