Semantic Web provides opportunities for new ways of retrieving, managing
and exchanging information on the Web. The aim is to produce documents for
automatic use and not only human reading as in the current Web. The SW
requires the annotation of documents with ontology-based semantics.
The current expectation is that users will annotate
their own documents manually. However, Web users are very unlikely to
annotate their own documents and even if they did, the quality of their
annotation could below in the average case. Moreover, there is the
concrete risk that
professional spamming companies will undermine the usability of the SW
with devious annotations.
In this talk, I will propose producing automatic
semantic annotation engines (SAE). SAEs work in a way similar to today's
search engines and allow annotating and retrieving information in large
repositories for SW uses. If successful SAEs will avoid the bottleneck of
human centred annotation and the dangers of low quality or devious
annotation. Our methodology is based on exploiting the redundancy of
information on large repositories in order to train information extraction
systems in an unsupervised way. Learning is seeded by integrating
information from structured sources (e.g. databases and digital
libraries). Retrieved information is then used to bootstrap learning for
simple Information Extraction (IE) methodologies (e.g. wrappers), which in
turn produce more annotation to train more complex IE engines.
Ciravegna is Full Professor of Computer Science at the University of
Sheffield. He has been active in information extraction from texts (IE)
since 1988. He is Coordinator and Principal Investigator of Dot.Kom, an EU
funded project on the use of adaptive IE for Knowledge Management. He is
also co-investigator and technical manager for Sheffield of the EPSCR-funded
Before joining Sheffield in 2000,
he coordinated IE activity at the Fiat Research Centre, Turin, Italy
(1991-1993) and ITC-Irst, Trento, Italy (1995-2000).
He has designed and coordinated the development
of a number of systems, and, in particular, of (i) Sintesi, an industrial
system for IE from car failure reports developed at Centro Ricerche Fiat.
Sintesi was as one of the pioneering pieces of work in IE in Europe
(ii) Pinocchio, a system for classical IE requiring reduced skills in NLP
for covering new scenarios/applications/languages. (iii) LearningPinocchio
a system for wrapper-like adaptive IE that obtained excellent experimental
results with respect to the state of the art and has become one of the
first industrial systems for adaptive IE in the world. (iv) Amilcare, an
adaptive IE system specifically designed for knowledge management
currently distributed to about 50 sites. (v)Melita, a tool for user
centred document annotation (vi) Armadillo, a system for mining the Web
using adaptive IE.