The Corpora Management System
Based onJava and Oracle Technologies
Serge Yablonsky
Petersburg Transport University, Computer Department, Moscow ay., 9,
St Petersburg, 190031, Russia
Russicon Company, Kazanskaya str., 56, ap.2, St Petersburg, 190000, Russia
serge yablonsky@hotmail.com
; root@russicon.spb.su;
http://www.russicon.ru
Abstract
The paper discusses the corpora manage-
ment system (CMS) design that uses Java
and Oracle9i DBMS to support strategic
corpora analysis. We present the pilot web-
based CMS to support linguists in their
daily work. The system offers facilities to
assist linguists and internet users as they
search for relevant material, and then clas-
sify and annotate this material.
1 Introduction
There's a wide class of documental management
solutions and products that fall under the rubric
"corpora and text mining". They are similar to data
mining solutions in that they deal with large vol-
umes of data, but the difference between the two
technology solutions is that while data mining ex-
tracts, analyzes, and summarizes numerical, struc-
tured data, text mining handles large volumes of
unstructured, text-based data. Document systems
with large-scale linguistic annotation are used by a
wide range of research and commercial applica-
tions.
This paper presents a web-based text corpora de-
velopment system (CMS) that focuses on the de-
velopment of UML-specifications, architecture and
actual implementations of DBMS tools to support
strategic corpora analysis.
We present the basic features of a prototype cor-
pora managementsystem under development in-
tended to support linguists in their daily work. The
system offers facilities to assist linguists and inter-
net users as they search for relevant material, and
then classify and annotate this material in a reposi-
tory.
The CMS is implemented using Javaand commer-
cial DBMS Oracle9i.
2 System Overview
The Corporamanagementsystem combines Java,
XML, XSL, HTML, and Oracle9i components
(Yablonsky S.A., 2002). The system was by adapt-
ing existing and new DBMS andJava tools to the
necessities of the intended task for the Russian
language (Yablonsky S.A., 2000).
CMS consists of such main parts (see figure 1):
Corpora — files in more then 150 different for-
mats (doc, rtf, pdf, htm, xml, etc.);
Annotated corpora — files in XML format us-
ing XML Corpus Encoding Standard (XCES,
http://www.xml-ces.org
— Ide, N. & Brew, C.,
2000, Ide, N., Romary L., 2001) and text for-
mats;
Oracle 9i DBMS Enterprise Sever (Release 2)
with such main counterparts:
o
Grammatical dictionaries
(inflec-
tion paradigms of the given lan-
guage) for the languages that are
not supported by Oracle Text;
o
Ontologies / WordNet
(Fellbaum
C.) /
Domain Thesauruses
for
given languages;
o
Word Index
(index of all entry
words or lemmas of Corpora);
179
Search Engine
Concept Clustering (Ontology
Thesaurus)
Document Management System
Graphical Display and Navigation
DBMS Oracle 9iR2
( C-cirp
-Cr—
a
(PC,
Internet/intrar
- Ontology database
-
Grammatical Dictionary
XML, HTML, Doc, PDF
etc.
- Word Index
(150 format
types)
Web-server (Apache 2.x + Tomcat 4.1.x)
Annotated Corpora
XML
41101rn
et/I ntraneNP
Browser
Browser
Browser
Figure
1. CMS
structure
O
Oracle Text.
Web Server (for example, Apache 2.x plus
Tomcat 4.1.x) with such Java Server Pages
(JSP) counterparts:
O
Search Engine;
o
Concept Clustering;
o
Document Management System;
o
Interface Subsystem.
We present the fragment of CMS UML-
specification built by Rational Rose (Rational Rose
Enterprise Edition Documentation, 2001) that
could be expanded in future by community of lan-
guage and speech resources developers (see figure
2).
Here, for example, the relational table DOCS con-
tains such attributes:
DOCS_NAME — document name;
DOCS_AUTHOR — document author;
DOCS_HTTP — document path;
DOCS_LANG — document language;
DOCS_LANG_CODE — document coding;
DOCS_DATE — date of including in the Cor-
pora;
DOCS_EXTENSION — document file exten-
sion.
The set of different types of UML-specifications
brings us to full three-level system, including user,
business and data services.
180
DOC S_NO = DOCS_NO
<<R
elat
onaqalJe>>
WORD_INDEX
S NO P<
_VVOR D VAR C H AR 2
#W_N
EVV NUMBER
IZ•V\I_VVORD_PK
=
VVI_WORD
<<F K>>
RESULTS D
<<FE>>
RESULTS QUERIES
PC
ORY_NO = QRY_NO
«Relalionakiatle»
R ESULTS
#QRY_N
0 NUMBER
•
D OC S_N 0 NUMBER
R
ES PAGE NO NU
VEER
DOCS_NO_P1
,
( = DOC S_N 0
OD
0 CS_
#00
CS_
#12)0
CS_
#00
CS_
#RUBRI
#00
CS
#D
0
CS_
LAN G_
CODE
:
VAR CHAR
2
#11)
0
CS_
DATE
:
DATE
t
OCS_EXTEN
SION
:
VAR
CH
AR 2
0 CS_ NANE_ PK
=
D 0 CS_ NAME
OCS_N0
_IN D
=
DOCS_N 0
NUMBER
NAIvE VARC HAP
2
AU TH OR VAR CH AR 2
HTTP VAR CHAR 2
CS_N 0 NUMBER
LANG VAR CHAR 2
<<R ationaiTabl e>>
RES_ADV_SEAR
#DOCS_N 0 NUMBER
A
r#SEARCH_ID
= SEAR
C H_E)
<<RelationalTable>>
RU BR
ICS
.1.RU
BR ICS_N 0
:
NUMBER
RU BR ICS_PAR ENT
:
NUMBER
#RU
BR I
CS_
N AME
:
VAR
CH
AR
2
t
U
BR ICS_DESC
:
VAR C HAP
2
U BR ICS_N AME_PK
= RUBRICS_NAME
UBRICS_N O_IND = RUBRICS_NO
<<R
narraNe>>
QUERIES
#QP
Y_N 0 NUMBER
#QP
Y_QU
ER
Y
VAR CHAR
2
#OR
Y_R B1PrO NUMBER
#QR
Y_EXTEN
SION
VAR C
HAR 2
#OR
Y_LANG VAR CH AR2
#QR
Y_DATE VAR CHAR
2
#QR
Y_IN CLUDE
1.
4.
QUER ES_ Q UER Y_PE = OR Y_QUERY
4
r#QU
ER ES_N O_IND
=
QR Y
NO
<<R
e
I
ati
anal
Tabl
>>
DO C S
03
<<F1-4>>
DOGS
RUBRICS
N
0_ FK
RUBRICS NO = RUBRICS_NO
REF
ADV
3
ER
DOD
S_NO =
DOD S_NO
Figure 2. Fragment of UML notation of data model
UML specification of business services uses stan-
dard UML notations of standard linguistic annota-
tion andcorpora manipulation procedures.
Reusability of linguistic andcorpora manipulation
business services could be achieved by usage of a
widely accepted set of UML notation standards for
corpus-based work in natural language processing
applications.
3 System Features
The powerful search engine of prototype system
particular uses advantages of Oracle Text search
and services and includes (Oracle9i Database
Documentation):
•
Content-based retrieval
on free text with both
literal (word) predicates and thematic predi-
cates. It includes: a comprehensive range of
operators and index preferences (e.g. Boolean,
exact phrase match, proximity, section search-
ing, fuzzy, stemming, wildcard, thesaurus,
stopwords, case sensitivity, and search scor-
ing), "about" search, structured search, broad
document format support and multi-language
support. For example, texts can be searched for
stems of words e.g "teach" would return
"teaching", "taught" etc. Fuzzy match can be
used if you are not sure of the spelling of a
word. A search can be done to find words
which are close to each other within a word.
Documents can be searched on what a docu-
ment is about as opposed to the existence of
specific words. Gists or Theme Summaries can
be produced which produce a summary of
what a document is about (using themes).
181
•
For XML framework XPath searching enables
sophisticated queries which can reference and
leverage the embedded structure of XML
documents — instead of using a text query to
find documents, you use a document to fmd
queries. XML path searching is able to per-
form sophisticated section searches: doctype
disambiguation, attribute value searching,
automatic section indexing, and more.
In addition to the search capabilities, a number of
other features are provided to simplify application
development.
•
Corpora Format Support.
In order to index
documents stored in a variety of native for-
mats, such as Word, Excel, PowerPoint,
WordPerfect, HTML, and Acrobat/PDF, sys-
tem supplies a broad variety of "filters" that al-
low documents stored in their native formats to
be indexed. Support for more then 150 file for-
mats in order to index files in a large range of
formats including Word, Acrobat, HTML,
WordPerfect, Powerpoint, Excel Flexible Stor-
age Location - documents can be stored and
indexed in the database, in a location pointed
to by a URL or in an external file.
•
Corpora Graphical Display and Navigation.
System services can convert any supported
document format to either plain text or format-
ted text (an HTML approximation retaining as
much as possible of the original formatting;
available for all formats except PDF). Both
plain text and HTML versions may be viewed
in a standard browser.
•
Document Management System.
System sup-
plies an administration tool through which all
major text maintenance and administration
functions may be performed.
•
Concept Clustering
identifies the relationships
between phrases of the texts and it builds a
"lexical network," grouping related phrases
and enhancing the most important features of
these groupings. The resulting patterns reveal
the conceptual backbone of the text collection.
Russian WordNet is used for basic conceptual
grouping.
•
Automatic Corpora Text
collection, tokeniza-
tion, part-of-speech tagging. Reusability of
linguistic resources is achieved by annotation
of texts using a common data model. For that
purpose the XML and related standards such as
XML Corpus Encoding Standard (XCES,
http://www.xml-ces.org ) (Ide,
et al.,
2000) are
used in the system.
CNS is localized for Russian language. Oracle 9i
Text doesn't have Russian language support. In
order to use Oracle Text capabilities we add
Rus-
sian Grammatical Dictionary
(morphosyntactic
dictionary) that consists of word paradigms with
grammatical characteristics of all Corpora word
index (Yablonsky S.A., 1998). It helps to perform
linguistic (paradigm) search in Corporaand is also
used for Corpora text annotation.
System could be easily adapted to different soft-
ware platforms (Java and Oracle) and the necessi-
ties of other languages (Unicode), making the
whole system portable to other platforms with
minimal changes.
The only language-specific resources are a large-
scale moiphosyntactic dictionary plus POS tagger.
References
Ide, N., Romary L., 2001. XML Support for Annotated
Language Resources. In:
Linguistic Exploration,
Workshop on Web-Based Language Documentation
and Description, Dec 12 - Dec 15, 2000, University
of Pennsylvania Philadelphia, Pennsylvania, USA.
Ide, N. & Brew, C., 2000. Requirements, Tools, and
Architectures for Annotated Corpora. In:
Proceed-
ings of the EAGLES/ISLE Workshop on Meta-
Descriptions and Annotation Schemas for Multimo-
dal/ MultimediaLanguage Resources and Data Ar-
chitectures and Software Support for Large Corpora.
Paris: European Language Resources Association.
Oracle9i Database Documentation (Release 9.0.2),
2002.
Rational Rose Enterprise Edition 2001, Documentation.
Fellbaum C. (ed.). WordNet. An Electronic Lexical Da-
tabase. Bradford Books.
Yablonsky S.A., 1998. Russicon Slavonic Language
Resources and Software. In: A. Rubio, N. Gallardo,
Yablonsky S.A., 2000. Russian Monitor Corpora: Com-
position, Linguistic Encoding and Internet Publica-
tion. Proceedings Second International Conference
on Language Resources & Evaluation, Athens,
Greece, 2000.
Yablonsky S.A., 2002. Corpora as Object-Oriented Sys-
tem. From UML-notation to Implementation. Pro-
ceedings Third International Conference on
Language Resources & Evaluation, Las Palmas, Ca-
nary Islands-Spain, 2002.
182
. Monitor Corpora: Com- position, Linguistic Encoding and Internet Publica- tion. Proceedings Second International Conference on Language Resources & Evaluation, Athens, Greece, 2000. Yablonsky. material, and then classify and annotate this material in a reposi- tory. The CMS is implemented using Java and commer- cial DBMS Oracle9 i. 2 System Overview The Corpora management system combines Java, XML,. notation of data model UML specification of business services uses stan- dard UML notations of standard linguistic annota- tion and corpora manipulation procedures. Reusability of linguistic and corpora