Extracting Information from the Web

Published on ven 02 août 2013 in Nepomuk, (Comments)

Google has just sent me a mail saying that I passed the midterm evaluations of the Google Summer of Code! Yay! I want to thank everybody that helps me in my GSoC project, everyone on the Nepomuk mailing list, and more generally everybody part of the KDE community. Everyone is very friendly, and working on KDE is a very enjoying experience!

As the query parser and the query builder widget come close to being usable (I have still bugs to fix, and things to change in order to make the query parser more localization-friendly), I'm beginning to think about other things on which I can work. One of these thinks is storing in Nepomuk information that comes from the user's web browsing.

The Problem

Nepomuk works really well, and now has a good query parser and a fairly nice query builder widget (I think). You can index your files, meta-data about them is extracted and stored in the database, and you can query this database using Dolphin or KRunner. The problem is that this indexing is currently limited to files.

If you use KMail or any Akonadi-based application, your e-mails and contacts are also stored in Nepomuk, ready to be queried using queries like "mails from John". The problem is that KMail is not used by every KDE user, as some of them prefer other mail clients (that don't use Akonadi). In fact, the majority of the KDE users use webmails, and their instant messaging applications are not KDE ones. Furthermore, many Linux distributions install Firefox as the default web browser. Most of our users therefore don't enjoy the Nepomuk integration that KMail, Kopete and other applications like that provide.

Indexing Information from the Web

My idea is to index in Nepomuk information found on the web, but not automatically, not by using a web crawler. The user itself will provide this information by browsing the web. The idea is to implement a Firefox add-on (and after that a Chrome add-on, as Chrome is also widely used and it seems that Rekonq supports or will support Chrome add-ons) that extracts information like that (if the user allows it to do so) :

  • The fact that a page has been visited (I just have to check that Nepomuk has an ontology for storing web history)
  • If the page is recognized as being one of a known web-mail, its DOM structure is explored to find information about a displayed mail or contact
  • When downloading files, the URL from which it is downloaded can be stored in Nepomuk (there is already a Firefox add-on that does that)

More File Indexers

As Vishesh Handa once explained in a blog post, Nepomuk uses file indexers to extract information about files (line count, video format, information about photographs, etc). Adding a file indexer is very easy, even complex ones are only a couple hundred lines long. The file indexer gets a file URL, passes it to a library (ffmpeg, taglib, etc), gets information and stores it into Nepomuk.

For my new experiment of storing web information in Nepomuk, I have to develop three new file indexers:

  • An e-mail indexer. It uses the wonderful KMIME and KMBOX libraries (part of kdepimlibs) to open mbox files and to index the mails it contains. For maildir directories, the e-mails are directly indexed using only KMIME. The indexing consists of storing in Nepomuk the plain text content of the mails and their headers (sender, receiver, etc). Senders and receivers are also used to set the e-mail address of contacts. This is interesting because the office extractors (MS Office and ODF) already create contacts for the authors of your documents. This indexer can therefore add an e-mail address information to these contacts.
  • A vCard indexer, that uses the KABC library of kdepimlibs to extract detailed contact information from vCard files (name, street address, phone number, etc).
  • An indexer for nepomuk-specific file formats. For instance, a file format can represent associations between downloaded files and their original URL. Another one can represent a piece of navigation history.

Two of these indexers use the wonderful libraries developed by the kdepimlibs developers. They are rock-solid and implement very complex standards (MIME files are hairy to handle). I want to thank them very much for these libraries that are a pleasure to use:

1
2
3
4
5
6
7
8
9
KMime::Message message;

message.setContent(raw_data_from_file);
message.parse();

resource.addProperty(
    NMO::plainTextMessageContent(),
    message.decodedText(true, true)
);

Bridging the Web and Nepomuk

Why do I use file indexers to index information coming from the web ? Because browser extensions are limited. Firefox and Chrome are both based on GTK+, and integrating Nepomuk and its Qt event loop would be terribly difficult. Moreover, Chrome does not allow native extensions at all (I'm not sure about that, my information comes from very early Chrome versions).

Browser extensions can perform HTTP queries, though, but I do not want to develop an HTTP server that would listen to POST request to create Nepomuk resources. I have already developed an HTTP/1.1 parser (with support of chunked encoding etc), but it is buggy and fairly complex.

The solution lives in files: Firefox and Chrome extensions both are allowed to create local files (it's easy to do in Firefox, less so in Chrome, but possible). So, when a piece of information is extracted from a web page, the extension only needs to format it according to a web standard (MIME, mbox, vCard, etc) or a Nepomuk-specific one (for history and downloaded files), and then to store it in a temporary directory monitored by nepomukfilewatch. nepomukfilewatch is a service that indexes files when they are created or modified, so it is possible to detect these new files and to index them. The bridge is in place.

Using file indexers also has a nice side effect: users using Thunderbird/Mutt/etc can point nepomukfilewatch to their e-mail directories (it is a simple as checking them in the KCM module of Nepomuk), and have their e-mails and contacts indexed by the e-mail indexer. Akonadi users don't have to do anything as Akonadi already stores information in Nepomuk. And as the e-mail indexer skips e-mails already stored in Nepomuk (a mbox file only grows and can contain thousands of mails, it would be way too slow to re-index them at every run), if a mail folder is indexed by both the e-mail indexer and Akonadi, there will be no conflict.

Current status

As this projects has not (yet) been approved by the Nepomuk developers, all the code still lives in the steckdenis-webindexers branch of my nepomuk-core git repository. The code is still minimal, but the e-mail indexer is already there. It can index mbox files and plain e-mails stored in maildir directories. I plan to implement a vCard indexer shortly.

As this project is in a very early stage, I encourage you to comment on it and to give me your impressions. Do you think that this architecture is right ? Is my use-case a good one (the fact that most users nowadays use webmails) ?

« Translations and Better Auto-Completion   A Nepomuk Integration Plugin for Konqueror »