Last week, I presented an idea about indexing messages from webmails in Nepomuk. The summary of this idea is to implement a browser extension for Firefox, Chrome and Konqueror. This extension parses the DOM tree of every page visited by the user that belongs to a webmail. When e-mails are found, they are extracted and stored in a temporary file. Nepomukfileindexer then quicks in and indexes these e-mails in the Nepomuk database.
The reason why I started this experiment is because many users use webmails instead of mail client programs like KMail or Thunderbird. The information about these emails cannot therefore be indexed by Nepomuk. As there are many different webmails, the biggest part of the project is to implement parsing rules for each of them. The browser plugin itself is quite small.
Extracting Information From the DOM Tree
Last week, I implemented the file indexers that are responsible of indexing mbox files and MIME files into the Nepomuk database. These indexers are now in a fairly good shape. This week, I started to implement the browser plugins. I wanted to start with Firefox, as it is the most used web browser, but my experience with Javascript is quite limited. I therefore started with Konqueror, more on that later.
The browser plugin reacts on DOM tree changes. When the DOM tree of a page has changed, the URL of the page is used to determine on what webmail the user is. If rules exist for this webmail, they are applied.
A rule is a simple JSON file that describes how to extract information from the DOM tree. I use JSON because parsing it in C++ is easily done using QJSON, and parsing JSON in Javascript is as easy as an eval
call. The DOM is read using CSS selectors, that is to say selectors as used in CSS style sheets (div.test
matches every <div>
tag that has the class "test"). The first rule I developed is the one for Yahoo! Mail in French (Yahoo! Mail has the bad idea of displaying a strongly localized date-time, and the ISO-formatted date-time is found nowhere in the document):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | {
"urls": ["*fr*.mail.yahoo.com/neo/*"],
"date_format": "ddd, d MMM. yyyy à h:mm",
"emails": {
"node": "div.message.content",
"date": {
"node": "div.info div.date"
},
"title": {
"node": "div.info h3 span.txt"
},
...
"content": {
"node": "div.msg-body"
}
}
}
|
Applying this JSON pattern against the DOM tree yields a tree of QVariant values. Each QVariant is either a list, a map or a string. The tree looks like that:
1 2 3 4 5 6 | emails = [
{"date": "Mar, 7 aoû. 2013 à 13:12,
"title": "This is a mail",
"content": "Hi\n\nThis is an email!\n\nCheers"
}
]
|
This data is stored into a MIME file using the wonderful KMIME::Message
class. The MIME file is stored in /tmp/kde-username/
.
The Konqueror plugin
- The source code is available my nepomukintegration scratch repository.
The Konqueror plugin is currently only 300-lines long. QJSON does all the JSON parsing, KMIME::Message
writes the MIME files. What is left to do in C++ is matching the patterns against the DOM tree. Here, Konqueror acts as two web browsers: you can choose to display web pages using either KHTML or kdewebkit. The two engines do not share the same structure. KHTML is based on KDE specific DOM classes, and kdewebkit uses a QWebView
.
Matching a pattern therefore needs to be done for the two engines. Instead of duplicating all the code, I used a big template function. When the APIs of the two engines are different, I implement utility functions with an overload for each engine:
1 2 3 4 5 6 7 8 9 | static QString elementContent(const DOM::Element &element)
{
return element.textContent().string();
}
static QString elementContent(const QWebElement &element)
{
return element.toPlainText();
}
|
The template functions only uses the utility functions, and the compiler chooses the right ones.
Status
The Konqueror plugin is now quite usable. It works, it does not crash, and my emails are found. In fact, my emails get properly extracted to temporary files, and running nepomukindexer
correctly indexes the data in my Nepomuk database. The plugin can therefore be considered as starting to be functional.
Indexing mails is one thing, contacts is another one. "Simple" contacts are already indexed: if the sender of an e-mail (identified by its name and its e-mail address) is not already present in the Nepomuk database, a new contact is added. By parsing contact-specific pages of webmails, it should be possible to extract more information about the contacts, like their birthday, gender, etc. This information will be stored in vCard files, indexed by the vCard file indexer I developed last Sunday. Another data I would like to index is history.
You may be wondering why I'm storing the extracted information in files instead of storing it directly in Nepomuk. In fact, Konqueror is a KDE application and has access to Nepomuk. The reason is that I don't want to duplicate work. Nepomukindexer already handles the cleanup of existing meta-data, and my email indexer ensures that an email never gets indexed more than one time. Moreover, the extension needs to be quite "cross-browser compatible", and Firefox and Chrome have no access to Nepomuk (Nepomuk requires a Qt event loop, and integrating the Qt event loop in GTK-based applications is tricky, see what the GTK Qt style has to do).