Using Text Mining and Explicit Relationships for Enhanced Retrieval and Discovery of Desktop Documents

Type of Thesis: 
Master Thesis

Computer operating systems provide the users with abilities to create, organise and store information. Most of this information is in the form of documents organised in the hierarchical folder structure provided by the operating systems. In order to access information, users are limited to the operating system’s structure guided navigation or through keyword search. Today’s possibilities of accessing the different documents using similarity of content or explicit relationships with other documents are limited. One of the main reasons for this limitation are “proprietary” document formats containing the user's information. Aside from the fact that they prevent third-party applications to access any structural metadata that is controlled by the applications “owning” the documents, they are tightly packaged; all exist wrapped up as independent files due to their simple linking models. Most existing document formats only pay attention to links to web resources, neglecting the variety of other document formats. As a step towards a better information management, we have built an Open Cross-Document Link Service for Integrating Different Document Formats. With our linking service, users can establish advanced (e.g. bidirectional and multi-directional) hyperlinks between snippets of information in different document formats. Moreover, the linking service is extensible to address the multitude of existing document formats as well as emerging new document formats.


The goal of this thesis is to improve the way we access, navigate and retrieve our documents stored in the desktop. The student will use some text mining algorithms to find similarity of content in the different documents. Moreover, explicit hyperlinks created by our linking service should be used to establish new relationships between different documents. For example, if document “A” has a link to document “B” and document “B” has a link to document “C”, then probably document “A” has a relationship with document “C”. The student should use this metadata later to enhance retrieving and accessing the documents.

Background Knowledge: 
  • Java


Technical challenges: 
  • The student will investigate the state of art in accessing and retrieving desktop documents as well as some text mining algorithms.
  • The student will implement an application that uses a suitable data mining algorithm and the explicit hyperlinks to establish relationships between desktop documents.
  • The student should implement a usable application interface that users can use for navigating and retrieving their desktop documents.
Ahmed A. O. Tayeh
Academic Year: