Fabio Ciravegna's talk in the 2nd Ontobasis Workshop

Title: Learning to Mine the (Semantic) Web
Approximate length: 50 minutes
Abstract: The Semantic Web provides opportunities for new ways of retrieving, managing and exchanging information on the Web. The aim is to produce documents for automatic use and not only human reading as in the current Web. The SW requires the annotation of documents with ontology-based semantics.

The current expectation is that users will annotate their own documents manually. However, Web users are very unlikely to annotate their own documents and even if they did, the quality of their annotation could below in the average case. Moreover, there is the concrete risk that
professional spamming companies will undermine the usability of the SW with devious annotations.

In this talk, I will propose producing automatic semantic annotation engines (SAE). SAEs work in a way similar to today's search engines and allow annotating and retrieving information in large repositories for SW uses. If successful SAEs will avoid the bottleneck of human centred annotation and the dangers of low quality or devious annotation. Our methodology is based on exploiting the redundancy of information on large repositories in order to train information extraction systems in an unsupervised way. Learning is seeded by integrating information from structured sources (e.g. databases and digital libraries). Retrieved information is then used to bootstrap learning for simple Information Extraction (IE) methodologies (e.g. wrappers), which in turn produce more annotation to train more complex IE engines.


About Fabio Fabio Ciravegna is Full Professor of Computer Science at the University of Sheffield. He has been active in information extraction from texts (IE) since 1988. He is Coordinator and Principal Investigator of Dot.Kom, an EU funded project on the use of adaptive IE for Knowledge Management. He is also co-investigator and technical manager for Sheffield of the EPSCR-funded IRC AKT.

Before joining Sheffield in 2000, he coordinated IE activity at the Fiat Research Centre, Turin, Italy (1991-1993) and ITC-Irst, Trento, Italy (1995-2000).

He has designed and coordinated the development of a number of systems, and, in particular, of (i) Sintesi, an industrial system for IE from car failure reports developed at Centro Ricerche Fiat. Sintesi was as one of the pioneering pieces of work in IE in Europe [CowieLehnert96].
(ii) Pinocchio, a system for classical IE requiring reduced skills in NLP for covering new scenarios/applications/languages. (iii) LearningPinocchio a system for wrapper-like adaptive IE that obtained excellent experimental results with respect to the state of the art and has become one of the first industrial systems for adaptive IE in the world. (iv) Amilcare, an adaptive IE system specifically designed for knowledge management currently distributed to about 50 sites. (v)Melita, a tool for user centred document annotation (vi) Armadillo, a system for mining the Web using adaptive IE.

