Automatic AcquisitionofNamedEntityTaggedCorpusfromWorld Wide
Web
Joohui An
Dept. of CSE
POSTECH
Pohang, Korea 790-784
minnie@postech.ac.kr
Seungwoo Lee
Dept. of CSE
POSTECH
Pohang, Korea 790-784
pinesnow@postech.ac.kr
Gary Geunbae Lee
Dept. of CSE
POSTECH
Pohang, Korea 790-784
gblee@postech.ac.kr
Abstract
In this paper, we present a method that
automatically constructs a Named En-
tity (NE) taggedcorpusfrom the web
to be used for learning ofNamed En-
tity Recognition systems. We use an NE
list and an web search engine to col-
lect web documents which contain the
NE instances. The documents are refined
through sentence separation and text re-
finement procedures and NE instances are
finally tagged with the appropriate NE cat-
egories. Our experiments demonstrates
that the suggested method can acquire
enough NE taggedcorpus equally useful
to the manually tagged one without any
human intervention.
1 Introduction
Current trend in NamedEntity Recognition (NER) is
to apply machine learning approach, which is more
attractive because it is trainable and adaptable, and
subsequently the porting of a machine learning sys-
tem to another domain is much easier than that of a
rule-based one. Various supervised learning meth-
ods for NamedEntity (NE) tasks were successfully
applied and have shown reasonably satisfiable per-
formance.((Zhou and Su, 2002)(Borthwick et al.,
1998)(Sassano and Utsuro, 2000)) However, most
of these systems heavily rely on a taggedcorpus for
training. For a machine learning approach, a large
corpus is required to circumvent the data sparseness
problem, but the dilemma is that the costs required
to annotate a large training corpus are non-trivial.
In this paper, we suggest a method that automati-
cally constructs an NE taggedcorpusfrom the web
to be used for learning of NER systems. We use an
NE list and an web search engine to collect web doc-
uments which contain the NE instances. The doc-
uments are refined through the sentence separation
and text refinement procedures and NE instances are
finally annotated with the appropriate NE categories.
This automatically taggedcorpus may have lower
quality than the manually tagged ones but its size
can be almost infinitely increased without any hu-
man efforts. To verify the usefulness of the con-
structed NE tagged corpus, we apply it to a learn-
ing of NER system and compare the results with the
manually tagged corpus.
2 Automatic Acquisitionof an NE Tagged
Corpus
We only focus on the three major NE categories (i.e.,
person, organization and location) because others
are relatively easier to recognize and these three cat-
egories actually suffer from the shortage of an NE
tagged corpus.
Various linguistic information is already held in
common in written form on the web and its quantity
is recently increasing to an almost unlimited extent.
The web can be regarded as an infinite language re-
source which contains various NE instances with di-
verse contexts. It is the key idea that automatically
marks such NE instances with appropriate category
labels using pre-compiled NE lists. However, there
should be some general and language-specific con-
Web
documents
W1
W2
W3
…
URL1
URL2
URL3
…
Web search
engine
Web robot
Sentence
separator
Text
refinement
S1
S2
S3
…
1.html
2.html
…
1.ans
2.ans
…
NE list
Web page URL
Separated
sentences
Refined
sentences
NE tag
generation
S1(t)
S2(t)
S3(t)
…
NE tagged
corpus
Figure 1: Automatic generation of NE tagged corpus
from the web
siderations in this marking process because of the
word ambiguity and boundary ambiguity of NE in-
stances. To overcome these ambiguities, the auto-
matic generation process of NE taggedcorpus con-
sists of four steps. The process first collects web
documents using a web search engine fed with the
NE entries and secondly segments them into sen-
tences. Next, each sentence is refined and filtered
out by several heuristics. An NE instance in each
sentence is finally tagged with an appropriate NE
category label. Figure 1 explains the entire proce-
dure to automatically generate NE tagged corpus.
2.1 Collecting Web Documents
It is not appropriate for our purpose to randomly col-
lect documents from the web. This is because not all
web documents actually contain some NE instances
and we also do not have the list of all NE instances
occurring in the web documents. We need to col-
lect the web documents which necessarily contain
at least one NE instance and also should know its
category to automatically annotate it. This can be
accomplished by using a web search engine queried
with pre-compiled NE list.
As queries to a search engine, we used the list
of Korean Named Entities composed of 937 per-
son names, 1,000 locations and 1,050 organizations.
Using a Part-of-Speech dictionary, we removed am-
biguous entries which are not proper nouns in other
contexts to reduce errors of automatic annotation.
For example, ‘E¶(kyunggi, Kyunggi/business con-
ditions/a game)’ is filtered out because it means a lo-
cation (proper noun) in one context, but also means
business conditions or a game (common noun) in
other contexts. By submitting the NE entries as
queries to a search engine
1
, we obtained the max-
imum 500 of URL’s for each entry. Then, a web
robot visits the web sites in the URL list and fetches
the corresponding web documents.
2.2 Splitting into Sentences
Features used in the most NER systems can be clas-
sified into two groups according to the distance from
a target NE instance. The one includes internal fea-
tures of NE itself and context features within a small
word window or sentence boundary and the other in-
cludes name alias and co-reference information be-
yond a sentence boundary. In fact, it is not easy to
extract name alias and co-reference information di-
rectly from manually tagged NE corpus and needs
additional knowledge or resources. This leads us to
focus on automatic annotation in sentence level, not
document level. Therefore, in this step, we split the
texts of the collected documents into sentences by
(Shim et al., 2002) and remove sentences without
target NE instances.
2.3 Refining the Web Texts
The collected web documents may include texts ac-
tually matched by mistake, because most web search
engines for Korean use n-gram, especially, bi-gram
matching. This leads us to refine the sentences to ex-
clude these erroneous matches. Sentence refinement
is accomplished by three different processes: sep-
aration of functional words, segmentation of com-
pound nouns, and verification of the usefulness of
the extracted sentences.
An NE is often concatenated with more than one
josa, a Korean functional word, to compose a
Korean word. Therefore we need to separate the
functional words from an NE instance to detect the
boundary of the NE instance and this is achieved
by a part-of-speech tagger, POSTAG, which can
detect unknown words (Lee et al., 2002). The
separation of functional words gives us another
benefit that we can resolve the ambiguities between
an NE and a common noun plus functional words
1
We used Empas (http://www.empas.com)
Person Location Organization
Training
Automatic 29,042 37,480 2,271
Manual 1,014 724 1,338
Test Manual 102 72 193
Table 1: Corpus description (number of NE’s) (Au-
tomatic: Automatically annotated corpus, Manual:
Manually annotated corpus
and filter out erroneous matches. For example,
‘E¶ê(kyunggi-do)’ can be interpreted as
either ‘E¶ê(Kyunggi Province)’ or ‘E¶+ê(a
game also)’ according to its context. We can remove
the sentence containing the latter case.
A josa-separated Korean word can be a com-
pound noun which only contains a target NE as a
substring. This requires us to segment the compound
noun into several correct single nouns to match with
the target NE. If the segmented single nouns are not
matched with a target NE, the sentence can be fil-
tered out. For example, we try to search for an NE
entry, ‘¶Á(Fin.KL, a Korean singer group)’ and
may actually retrieve sentences including ‘˚¶Á
ě(surfing club)’. The compound noun, ‘˚¶Áě’,
can be divided into ‘˚¶(surfing)’ and ‘Áě(club)’
by a compound-noun segmenting method (Yun et
al., 1997). Since both ‘˚¶’ and ‘Áě’ are not
matched with our target NE, ‘¶Á’, we can delete
the sentences. Although a sentence has a correct tar-
get NE, if it does not have context information, it is
not useful as an NE tagged corpus. We also removed
such sentences.
2.4 Generating an NE tagged corpus
The sentences selected by the refining process ex-
plained in previous section are finally annotated with
the NE label. We acquired the NE taggedcorpus in-
cluding 68,793 NE instances through this automatic
annotation process. We can annotate only one NE
instance per sentence but almost infinitely increase
the size of the corpus because the web provides un-
limited data and our process is fully automatic.
3 Experimental Results
3.1 Usefulness of the Automatically Tagged
Corpus
For effectiveness of the learning, both the size and
the accuracy of the training corpus are important.
Training corpus Precision Recall F-measure
Seeds only 84.13 42.91 63.52
Manual 80.21 86.11 83.16
Automatic 81.45 85.41 83.43
Manual + Automatic 82.03 85.94 83.99
Table 2: Performance of the decision list learning
Generally, the accuracy of automatically created NE
tagged corpus is worse than that of hand-made cor-
pus. Therefore, it is important to examine the useful-
ness of our automatically taggedcorpus compared
to the manual corpus. We separately trained the de-
cision list learning features using the automatically
annotated corpus and hand-made one, and compared
the performances. Table 1 shows the details of the
corpus used in our experiments.
2
Through the results in Table 2, we can verify that
the performance with the automatic corpus is supe-
rior to that with only the seeds and comparable to
that with the manual corpus.Moreover, the domain
of the manual training corpus is same with that of
the test corpus, i.e., news and novels, while the do-
main of the automatic corpus is unlimited as in the
web. This indicates that the performance with the
automatic corpus should be regarded as much higher
than that with the manual corpus because the per-
formance generally gets worse when we apply the
learned system to different domains from the trained
ones. Also, the automatic corpus is pretty much self-
contained since the performance does not gain much
though we use both the manual corpus and the auto-
matic corpus for training.
3.2 Size of the Automatically Tagged Corpus
As another experiment, we tried to investigate how
large automatic corpus we should generate to get the
satisfiable performance. We measured the perfor-
mance according to the size of the automatic cor-
pus. We carried out the experiment with the deci-
sion list learning method and the result is shown in
Table 3. Here, 5% actually corresponds to the size of
the manual corpus. When we trained with that size
of the automatic corpus, the performance was very
low compared to the performance of the manual cor-
pus. The reason is that the automatic corpus is com-
2
We used the manual corpus used in Seon et al. (2001) as
training and test data.
Corpus size (words) Precision Recall F-measure
90,000 (5%) 72.43 6.94 39.69
448,000 (25%) 73.17 41.66 57.42
902,000 (50%) 75.32 61.53 68.43
1,370,000 (75%) 78.23 77.19 77.71
1,800,000 (100%) 81.45 85.41 83.43
Table 3: Performance according to the corpus size
Corpus size (words) Precision Recall F-measure
700,000 79.41 81.82 80.62
1,000,000 82.86 85.29 84.08
1,200,000 83.81 86.27 85.04
1,300,000 83.81 86.27 85.04
Table 4: Saturation point of the performance for
‘person’ category
posed of the sentences searched with fewer named
entities and therefore has less lexical and contextual
information than the same size of the manual cor-
pus. However, the automatic generation has a big
merit that the size of the corpus can be increased al-
most infinitely without much cost. From Table 3,
we can see that the performance is improved as the
size of the automatic corpus gets increased. As a
result, the NER system trained with the whole au-
tomatic corpus outperforms the NER system trained
with the manual corpus.
We also conducted an experiment to examine the
saturation point of the performance according to the
size of the automatic corpus. This experiment was
focused on only ‘person’ category and the result is
shown in Table 4. In the case of ‘person’ category,
we can see that the performance does not increase
any more when the corpus size exceeds 1.2 million
words.
4 Conclusions
In this paper, we presented a method that automat-
ically generates an NE taggedcorpus using enor-
mous web documents. We use an internet search en-
gine with an NE list to collect web documents which
may contain the NE instances. The web documents
are segmented into sentences and refined through
sentence separation and text refinement procedures.
The sentences are finally tagged with the NE cat-
egories. We experimentally demonstrated that the
suggested method could acquire enough NE tagged
corpus equally useful to the manual corpus without
any human intervention. In the future, we plan to ap-
ply more sophisticated natural language processing
schemes for automatic generation of more accurate
NE tagged corpus.
Acknowledgements
This research was supported by BK21 program of
Korea Ministry of Education and MOCIE strategic
mid-term funding through ITEP.
References
Andrew Borthwick, John Sterling, Eugene Agichtein,
and Ralph Grishman. 1998. Exploiting Diverse
Knowledge Sources via Maximum Entropy in Named
Entity Recognition. In Proceedings of the Sixth Work-
shop on Very Large Corpora, pages 152–160, New
Brunswick, New Jersey. Association for Computa-
tional Linguistics.
Gary Geunbae Lee, Jeongwon Cha, and Jong-Hyeok
Lee. 2002. Syllable Pattern-based Unknown Mor-
pheme Segmentation and Estimation for Hybrid Part-
Of-Speech Tagging of Korean. Computational Lin-
guistics, 28(1):53–70.
Manabu Sassano and Takehito Utsuro. 2000. Named
Entity Chunking Techniques in Supervised Learning
for Japanese NamedEntity Recognition. In Proceed-
ings of the 18th International Conference on Compu-
tational Linguistics (COLING 2000), pages 705–711,
Germany.
Choong-Nyoung Seon, Youngjoong Ko, Jeong-Seok
Kim, and Jungyun Seo. 2001. NamedEntity Recog-
nition using Machine Learning Methods and Pattern-
Selection Rules. In Proceedings of the Sixth Natural
Language Processing Pacific Rim Symposium, pages
229–236, Tokyo, Japan.
Junhyeok Shim, Dongseok Kim, Jeongwon Cha,
Gary Geunbae Lee, and Jungyun Seo. 2002. Multi-
strategic Integrated Web Document Pre-processing for
Sentence and Word Boundary Detection. Information
Processing and Management, 38(4):509–527.
Bo-Hyun Yun, Min-Jeung Cho, and Hae-Chang Rim.
1997. Segmenting Korean Compound Nouns using
Statistical Information and a Preference Rule. Jour-
nal of Korean Information Science Society, 24(8):900–
909.
GuoDong Zhou and Jian Su. 2002. Named Entity
Recognition using an HMM-based Chunk Tagger. In
Proceedings of the 40th Annual Meeting of the As-
sociation for Computational Linguistics (ACL), pages
473–480, Philadelphia, USA.
. Automatic Acquisition of Named Entity Tagged Corpus from World Wide
Web
Joohui An
Dept. of CSE
POSTECH
Pohang, Korea 790-784
minnie@postech.ac.kr
Seungwoo. of the con-
structed NE tagged corpus, we apply it to a learn-
ing of NER system and compare the results with the
manually tagged corpus.
2 Automatic Acquisition