Writing a new Nepomuk Query Parser during the Summer

Published on mar 28 mai 2013 in Nepomuk, (Comments)

Hello everyone !

I'm Denis Steckelmacher, a Belgian student in the end of its second year of Computer Science at the Free Univerisity of Brussels. Yesterday, I had the pleasure to learn that my Google Summer of Code proposal, "A new Query Parser and Auto-Completed Input Field for Nepomuk", was accepted by KDE.

Project summary

Nepomuk is the software that indexes files, mails, contacts, and everything you ask it to index on your computer. All these resources are stored somewhere for future use, and are linked between them. The main goal of Nepomuk is not only to store resources, but also to find them. When I type something in KRunner and that files containing this or these words appear in the list, it is Nepomuk in action. When I'm looking for a mail, I think it is also Nepomuk that is used (in recent versions of KMail).

The goal of my project is to make this search experience more easy and enjoyable for us, humans. Currently, "queries" entered in a search field are parsed by a good enough parser that can understand words, tags, and some other things. My project is about rewriting this parser to be more extensible, and I will strive to make it the most human-friendly as possible.

For instance, it will be my first parser that will have an ambiguous grammar, because humans are ambiguous. For instance, if I type "two days ago" in a search field, I may want to find documents containing "two days ago" (I sent a mail to my sister and I only remember that this sentence was in it), or any of these three words, or any document created or edited two days ago. Even filtering documents by date is ambiguous, as I may want to find documents that were exactly touched two days ago, or I may also want to find the ones touched yesterday and this morning.

Add to this the possibility to filter by properties, for instance "size<=4M" or "hasTag:holidays", and you can get something very complex. Think about what Google allows you to write in its search bar ("my word site:mysite.com" for instance).

Action plan

The first thing I need to do is to discuss with the Nepomuk developers and people in charge of the localization. We need to define a precise yet human-friendly (and more importantly, implementable) grammar that will be understood by the parser. I hope to be able to put most of the features I want in the parser, but it may be very difficult to make it understand correctly what users without any technological knowledge will want to search for.

The time-line for this project is described near the end of my proposal. To summary it a bit, the aforementioned discussion will take place from now to the middle of June (and will continue if needed, but I hope to already have some feedback by the middle of June). Then, I will implement the parsing infrastructure. I need to design an Abstract Syntax Tree (a computer-friendly and unambiguous way of representing information) and the beginning of the parser.

During July, I will begin the second half of my proposal : a line edit that provides syntax highlighting and auto-completion. This way, users will be helped and will see how their query is understood by the system. The auto-completion will show them what is possible (I like to empower the users). For instance, after having typed "date:", the user will see "enter a date" that will open a calendar if clicked on, and "describe a date" with "last week" shown as example.

The syntax highlighting will also be a great debugging tool, as I will be able to see how the parser understands or misunderstands what I enter. Some features will be a bit tricky, for instance the optionality of quotes and nearly everything (I want "school date:two days ago" to be parsable, and my dream would be that "school, two days ago" or "school date:two days ago size < 4M" are also understood).

Any progress will be described on my blog that I hope will be registered on Planet KDE. Comments are welcome for any blog entry, and I they don't work (I use Disqus that I have never used before), don't hesitate to send me a mail (my mail account at yahoo.fr is steckdenis).

Current status

As my proposal was just accepted, I did not already have time to do very fancy things. The only code I developed before my proposal was accepted, because the feasibility of my proposal depends on it, is an human date-time parser.

This class is responsible for translating "Monday of last week" into 2013-05-20 (if the current date is 2013-05-28). It does that by matching locale-specific rules on the input string. Each rule is accompanied by actions to be performed on the date-time, that is at the beginning of the parsing set to the current date and time. The order of the rules is important as they have to be matched in the right order to be meaningful. For instance, "last week" will first be matched and will remove 7 days from the current date-time. Then, "Monday" is matched and will set the current day-of-week to Monday.

This code is as locale-independent as possible. There is no string matching in the code, no dependence on a specific character set, and everything, month names, numbers, etc, is read from the parsing rules. The advantage is that languages I really don't know will still be able to use this parser to parse dates, even if they are in another calendar system. The inconvenient is that parsing rules are complex, currently written in a big XML file with special syntax for placeholders (example below), and using regular expressions. The KDE translators seemed ok with that, but I don't want the parsing rules to be so complex that it takes months to have a nice set of parsing rules for every language.

<!-- Periods in this language (simple translations with metadata if needed) -->
<period type="year" value="10">decades?</period>
<period type="year">years?</period>
<period type="month" value="3">seasons?</period>
<period type="month">months?</period>

<values name="month">
    <value value="1">january</value>
    <value value="2">february</value>
    <!-- ... -->
<values name="number">
    <value value="1">one</value>
    <value value="1">a</value>
    <value value="1">first</value>
    <value value="2">two</value>
    <!-- ... -->

<!-- Parsing rules, here the documentation becomes necessary -->
<rule pattern="$number$ %period% ago">
    <sub value="$1" />  <!-- Remove $1 (the number) from the given period -->
<rule pattern="in $number$ %period%">
    <add value="$1" />  <!-- Add the number to the given period -->
<rule pattern="next %period%">
    <add value="1"/>    <!-- Add 1 to the given period -->

The complete file is en_US.xml in the Github repository of this project.

« New blog   Finding Information in Human-Entered Search Queries »