Porting the Nepomuk Query Parser to Baloo

Published on lun 10 mars 2014 in Nepomuk, (Comments)

As explained in my last blog post, Nepomuk has been replaced by Baloo, a new and simpler implementation of a desktop search engine built on the experience of Nepomuk. One consequence of this new project was that my query parser, that I developed during the GSoC 2013, was left in the Nepomuk world.

Several days ago, after university assignments and exams, I was finally able to find some time to dedicate to KDE. I used it to port the Nepomuk query parser to Baloo.

Baloo: Simple yet Powerful

The first thing that I needed to do before porting the query parser was to somewhat understand at least parts of the Baloo code base. I needed to find what was available, what was still compatible with Nepomuk-based code (Baloo keeps all the best ideas of Nepomuk), and where to put the parser.

I was quickly able to find my path in Baloo, as the code is much smaller than Nepomuk. The infrastructure is also simpler (in fact, the Nepomuk one was very complex, I remember having to wander in different levels of abstraction of the client side and on the server and finally in Soprano in order to find where a value was changed in a way that I did not like). In fact, everything in Baloo is clean and simple.

For instance, in Nepomuk, we had a Term class that represented "something" (a date-time, a comparison, a keyword, and also AND, OR and NOT operations). This class had many subclasses, one for each type of term. There was a LiteralTerm for simple values (a word, a number), a ComparisonTerm for comparisons (comparing a property with a literal term), and terms for logical operations. OrTerm and AndTerm both represented a logical operation, but were different subclasses with no common ancestor (except Term). My parser had to duplicate code in order to handle AND and OR comparisons, even though OrTerm and AndTerm behaved exactly the same way.

In Baloo, there is only one Term class that does everything. It may seem to be poor engineering, but it is actually very efficient and clean. A term is made of a property, a comparison operator, a value, a logical operator and subterms. AND and OR terms are simply terms whose logical operator is set to And or Or. No code duplication, I can simply use the subterms without having to worry about the logical operator. LiteralTerm is also gone and the API is far cleaner: gone are the Soprano::LiteralValues and complex property URIs. Properties are now simple strings, and values are QVariants.

Porting by Removing Code

The porting process began by copying the entire query parser into a new directory in Baloo Core. At first, I did not want to drop the entire history of the parser (Git has nice tools to merge code from somewhere into a new repository, without loosing any commit), but I finally thought that the history would only be noise, and that the parser would in fact never compile before the port is complete. I also sed'ed the files in order to change every occurence of Nepomuk to Baloo (in the comments and the license headers).

I started with the glue between the parser and Baloo. In Nepomuk, I needed a small utils.cpp file that performed repetitive tasks of wrapping and unwrapping terms from the Nepomuk world to the one of the parser. It contained several one-line and two-lines functions. I was very happy to remove most of these functions when porting to Baloo, as Baloo is clean and does not need repetitive boilerplate in order to use it.

When the utility functions were ported, I attacked the parsing passes. Each parsing pass is quite small and independent from the others. I was very surprised to see that the porting mostly consisted of removing code. I do not need anymore to pass LiteralTerms around and to build comparisons one small step at a time (first the operator, then the property, then only the value). A pass that consisted of 100 lines in Nepomuk is now reduced to approximately 60 lines.

Finally, I had to port the parser itself. I expected that to be a big task, as the parser consists of the pattern matcher (doing the proper recognition of "sent to X") and everything needed in order to parse date-times. Finally, it took only one day, and I was able to remove many lines of code. No big feature was lost and no big refactoring was done, as Baloo is still quite close to Nepomuk, but I was able to remove many duplicated lines and boilerplate. For instance, there were 40 lines just to recurse in the subterms of terms (using a different method for inversions, AND and OR). Now, a small 6-lines loop can simply iterate on Term::subterms and handles all cases (inversions, AND, OR or anything that could be invented in the future). The query parser weighted 2.132 SLOC in Nepomuk and is now consists of 1.892 SLOC.

Trying the Query Parser

The query parsed, ported to Baloo, lives in kde:baloo, branch queryparser. You can use it in your applications by including baloo/queryparser.h. The API is very simple and the parser does all the work. You can have a look at the unit tests in src/core/autotests/queryparsertest.cpp. Here is one unit tests, to show how simple to use the parser and Baloo are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
void QueryParserTest::testReduction()
{
    QCOMPARE(
        QueryParser::parseQuery("size > 2K and size < 3K"),
        Query(
            Term("size", 2048, Term::Greater) &&
            Term("size", 3072, Term::Less)
        )
    );
}

The parser will not be released in KDE SC 4.13, but may be finished in time for KDE SC 4.14. I also still have to port the query builder widget, but it should not be too difficult as the widget is simple and depends on nothing except the query parser and Term.

The GSoC 2014

Before ending this blog post, I would like to say that I will apply for this year's Google Summer of Code. Working with KDE developers is a formidable experience and I would be very happy to be able to participate again this year. I have several ideas of projects, but if you think of something involving a parser, artificial intelligence, or low-level stuff (optimizing something or stuff that generally gives headaches), I would be glad to hear of it. Here are the ideas (found on the KDE Ideas page or "invented") that currently interest me the most:

  • Working on Baloo, because I already know the library (and I like it). There are many small things to do everywhere (new file indexers, adding features, implementing more fine-grained date filters, etc), but I don't know if there is a big three-month project that could be done. I'm thinking about it :) . If you dream of something exciting that could be done using desktop search, I'm all ears.
  • Giving love to the KDevelop QML/JS language plugin. The plugin is already in a fairly good shape, but its maintainer has ideas of many improvements that could span an entire summer. What could be very interesting is parsing the QML files shipped with Qt (and Plasma and any other QML "library") in order to offer auto-completion of the items they expose. An item gallery may also be quite useful.
  • Add the ability to preview QML files in KDevelop. There is already one such thing in QtCreator but it seems that there are not many people liking it. I need to investigate and to see if a user-friendly previewer is possible. Plasmate for instance has a Plasmoid viewer that allows for easy testing of plasmoids.

« GroupedLineEdit Reused in Subsurface   GSoC 2014: Improving the QML/Javascript Language Support of KDevelop »