Articles in the Nepomuk category

A Nicer Query Builder Widget

Published on mar 16 juillet 2013 in Nepomuk, (Comments)

After 10 days of vacations, I'm now back at work for the rest of the GSoC period. Before my departure, I presented a syntax-highlighted query builder widget. It was based on a QTextEdit made to look like a QLineEdit, and a QSyntaxHighlighter subclass was responsible for the highlighting.

The result was quite nice, but not as nice as what Ivan Čukić imagined for this control. Since the first instant I saw his mockup, I wanted to have a widget like that in Nepomuk. The problem is that such a widget is very difficult to implement (and even more to implement correctly), and that I haven't found any existing code on Google, and the only application using this widget I know of is Yahoo! Mail. I therefore decided to implement this widget myself.

General Idea

The code of the widget lives in my branch of the nepomuk-widgets repository. Even though I tried to keep the widget general (the GroupedLineEdit class does not reference any Nepomuk class), I don't think it can already be useful to other projects. Don't hesitate to prove me wrong, though. If this widget one time becomes general enough and more sane than it is currently, I would like to have it merged into Qt (the widget doesn't use any KDE class).

The widget

The idea of the widget is to "group" terms into blocks. A block is a rounded rectangle, each of a different color, and having a small cross. When the user clicks the cross, the group is deleted. The blocks must be completely cosmetic for the user. That means that the full query builder still needs to behave like a QLineEdit: the user must be able to move the cursor using the arrow keys of his or her keyboard, and the cursor must not be stuck at one end of a block. It must be able to flow from a block to the next or the previous. The user must also be able to add text anywhere, even between blocks.

Blocks are added to the widget by the application, one at a time. Blocks cannot be removed, but the widget can be cleared (every block is removed, the text being preserved). When blocks need to change, the application thus clears all the blocks, then re-add the ones it needs. This is not the most efficient operation, but doing otherwise would have greatly complicated the API and the code itself.

Flowing From a Line Edit to Another One

"Flowing" is an operation needed when there is two line edits next to each other. When the user presses the arrow keys, the cursor moves in one of the line edits. What I want is to detect when the user tried to move left/right when the cursor was already at the left/right of one line edit. When that occurs, the cursor is placed at the right/left of the previous/next line edit. [ ][| ], with the vertical bar representing the cursor, becomes [ |][ ] when the cursor presses the left key.

Unfortunately, QLineEdit does not allow that. When the cursor moves, the cursorPositionChanged signal is emitted. This signal contains information about the old and the new position of the cursor. The problem is that this signal is only fired when the cursor actually moved. If the user presses Left when the cursor is already on the left side of the widget, the signal is not emitted.

I looked at the code of QLineEdit and QLineControl, and there was nothing hidden there to help me. The solution I used is to subclass QLineEdit and to reimplement its keyPressEvent method. I still need to find how to make all that work with right-to-left languages:

void GroupedLineEditEdit::keyPressEvent(QKeyEvent *e)
{
    if (e->key() == Qt::Key_Left && cursorPosition() == 0) {
        emit cursorBeforeStart();
    } else if (e->key() == Qt::Key_Right && cursorPosition() == text().length()) {
        emit cursorAfterEnd();
    }

    QLineEdit::keyPressEvent(e);
}

What Next ?

I invite you to test my code. This can be done by cloning the gsoc2013 branch of http://public.steckdenis.be/git/nepomuk-core and http://public.steckdenis.be/git/nepomuk-widgets. Then, build and install nepomuk-core, and build nepomuk-widgets. When everything is built, launch your_build_dir/test/querybuilderapp. A small window like the one shown at the beginning of this post pops up. Enter a query, and see it being highlighted in real-time.

If you have a slow computer, don't hesitate to test my code on it. Currently, the highlighting is done every time you type a character. If your CPU usage becomes too high when you test the widget, one solution would be to highlight the query not at each character typed, but every half second or so.

My next step now is to investigate the auto-completion stuff. Ivan also proposed a nice interface, and I have ideas on how to implement that. I hope to have something to show by the end of this week.

Syntax-Highlighting Experiments

Published on lun 01 juillet 2013 in Nepomuk, (Comments)

Sometimes, I'm very happy to work with Qt, because it provides everything I need, and in a very clever way. Syntax-highlighting is one of the many areas where Qt shines.

High-level syntax-highlighting of programming languages and configuration files is easily done using KTextEditor (the component used by Kate, KWrite, Kile, KDevelop, I think). One has simply to write an XML file describing the language to highlight, and the editor does the rest. Code-folding is also possible, as is every other feature everyone loves in Kate and KDevelop.

For my Google Summer of Code project, such a full-fledged editor is a bit overkill. What I need is a mean to syntax-highlight a one-line text edit. The highlighting itself is simple, because the "grammar" (if it really is one) of the parser is simple.

A One-Line Text Edit

Yesterday, I started by implementing a small editor based on QPlainTextEdit, mimicking QLineEdit. I couldn't directly use this class because it doesn't allow syntax-highlighting, I need a QTextEdit subclass and a QTextDocument. I therefore tried to make QPlainTextEdit look like a simple line edit. This can be done by configuring it properly (setting the size policy, disabling tabs, disabling the scroll bars and the word-wrap, etc) and by reimplementing its sizeHint() method. The reimplemented method behaves like the one of QLineEdit (original code from the Qt project FAQ):

QSize QueryBuilder::sizeHint() const
{
    QFontMetrics fm(font());
    QStyleOptionFrameV3 opt;
    QString text = document()->toPlainText();

    int h = qMax(fm.height(), 14) + 4;
    int w = fm.width(text) + 4;

    opt.initFrom(this);

    return style()->sizeFromContents(
        QStyle::CT_LineEdit,
        &opt,
        QSize(w, h).expandedTo(QApplication::globalStrut()),
        this
    );
}

The method calculates the area of the displayed text (using QFontMetrics), and then asks the style to compute the size of the whole widget. By telling the style that the widget is a CT_LineEdit, the size returned is the exact same size a line edit would have. The illusion is perfect:

Query builder widget

Choosing Colors

After having implemented the text editor, I started experimenting with the syntax-highlighting itself. My idea was to have every ComparisonTerm displayed in a different color, with the literal terms displayed in bold. During the development, I also thought that it would be nice to have "type hints" (mails, photos, documents, files, contacts, etc) displayed in italics.

My first idea was to follow the interface given by Ivan Čukić here. The problem was that I don't even know how his "single-line edit with grouped terms that can be removed by clicking on a cross" is officially called. So, I've found nothing on Google that could help me to implement that. I tried different solutions, but there were always problems (for instance, a QLineEdit draws its white back-ground and its text in one step, so I cannot draw boxes between the two). In order to still be able to experiment things, I decided to use a simpler approach (syntax-highlighting) for now.

In the above image, you can see that every comparison term is highlighted in a different color. The colors are coming from the Oxygen palette, and are roughly all of the same luminosity. It is very difficult to find colors that look well with each other, so if you have ideas, don't hesitate to tell me. I have selected 8 colors, and the highlighter cycles between them.

I also tried to use HSV colors (Hue, Saturation, Value). Saturation and Value were forced to values often found in Oxygen colors, and the hue was incremented by 71 degrees every-time the color had to change. The result was ok, but not so nice. Sure, there was up to 360 different colors, but they were mostly bad-looking and difficult to read. I abandoned the idea and decided to use Oxygen colors, that were carefully crafted by the great designers of KDE.

I developed this experiment directly in the Nepomuk Widgets repository. As I have no KDE developer account yet, the repository can be found at http://public.steckdenis.be/git/nepomuk-widgets . My code is in the gsoc2013 branch. Feel free to experiment and to modify the colors (a table in ui/querysyntaxhighlighter.cpp).

UPDATE:

Most users will never or very rarely use comparison terms. All they want is a tool that returns a list of documents matching terms. The syntax-highlighter I presented in this blog post chooses a different color for each comparison or literal terms. Queries like "cat dog rabbit" therefore are highlighted using 3 different colors, every term being in bold. This is not really nice.

So, I changed the highlighting a bit. Now, literal terms are not highlighted at all, only comparison terms and resource type terms (type hints) are. For comparisons, the whole comparison is colored, and its part that is not a literal term is also rendered in italics. So, "foo bar sent by Jimmy" becomes "foo bar sent by Jimmy":

Don't highlight literal terms

Another modification is that nested queries are detected and underlined, so the user can see where it ends, and potentially why its query does not return the expected results:

Underlining of nested queries

Merging the parser

Published on ven 28 juin 2013 in Nepomuk, (Comments)
As my query parser begins to be useful and to work well, I want to start working on the syntax-highlighted input field. As I do not want to multiply the experimental repositories (people need to find them, to build them, and when the work is upstreamed, I have to put a big notice saying that the repository is outdated), I plan to develop the input field in a separate branch of nepomuk-widgets.

The syntax-highlighted input field needs to use the query parser. If the query parser remains in a separate Github repository, I have to add it as an external dependency in nepomuk-widgets. Furthermore, I think for some days now that it would be interesting to see how the new parser integrates with Nepomuk as a whole.

Merging git subtrees
- My branch of nepomuk-core: gsoc2013 of http://public.steckdenis.be/git/nepomuk-core/. It will move to the official Nepomuk Core repository once I get a KDE developer account and the Nepomuk maintainers allow me to push the branch.
I love history. I think that big squashed merges are difficult to get reviewed because reviewers need to read and understand thousands of lines of code at once. When the development history is kept, it can be viewed progressively. The only problem is that development is never a straight line going towards the finished product. There are experiments, sometimes something gets reverted or canceled, or things simply change.

In the case of my parser, the grammar of the pattern matcher has changed after 2 days of development, and the pattern matcher became a class on its own a bit later. There are also parsing passes that got removed or merged into other ones (hourminute, that was responsible of recognizing hours and minutes, is now merged into datevalues, that assigns values to hours, minutes, seconds, and also days, months, weeks, etc).

I decided to add my parser into the Nepomuk Core source-tree, with its full history. I found some interesting instructions on Stack Overflow, and everything worked perfectly. Now, my branch of the nepomuk-core repository contains a new directory: libnepomukcore/queryparser/. All my files are there and build perfectly. I also removed the old query parser (it was just one file).

The new query parser is source- and binary-compatible with the old one. In fact, its .h header is the same as the one of the old parser. I simply copied it, and changed some parts of the documentation (the syntax is not the same, limitations are lifted, and other limitations are temporarily added).

Building libnepomukcore was simple and there was no problem. I fixed two compiler warnings exposed by the use of GCC (I usually compile with Clang) and optimization flags that uncover reads of uninitialized variables. When everything was built, I was able to launch nepomukquerytester, a tool that allows me to enter queries and to get the resulting SPARQL query that will be sent to the storage backend:

Some days ago, Vishesh Handa developed a cool new utility called nepomuksearch. It is a command-line utility (with nice colors) that takes a query on its command line and outputs the corresponding documents. Without any modification, it was able to use my new parser to show my files. Keywords, properties, dates, everything work. I was able to list the files I modified yesterday simply by searching for "modified yesterday". Nice.

Now that the parser is merged into Nepomuk, I can build a libnepomukcore library that contains everything the syntax-highlighted input field needs.

Positional information

Another new feature of the past days is that the parser stores positional information into the terms it produces. A term is the data-structure used to represent a parsed query. There are literal terms (integers, strings, date-times), comparison terms, resource type terms ("mails" gives the hint that e-mails must be returned by the query), and logical terms (and, or, not), among others.

I have submitted a review request to the Nepomuk developer, asking them if I can add two fields to the base Term class. One of those fields is a position, the other is a length. They allow the parser to remember from which portion of the user query string it parsed terms.

Every term has positional information, not only literal ones. For instance, a comparison term covers the part of the user query that describes the comparison. For "mails sent by John", the comparison terms covers "sent by John", the resource type term is linked to "mails", and the literal term simply is "John".

With this tree of terms, each having positional information, it is possible to do a pretty neat syntax highlighting: use a different nice color for each term, and render literal terms in bold face. This way, the actual content (the literal terms) is clearly seen by the user, and he or she also sees if properties are correctly recognized. An example of highlighted query is "mails sent by John" (I used italics font instead of colors).

Syntax-highlighting is not the only way of representing this kind of information. For instance, Ivan Čukić proposed a very great user interface that you can see here. I really find it beautiful and congratulate Ivan for it! I will see how it could be implemented with my parser (that recognizes complex patterns, even mixed patterns where date-times are spread between property comparisons).

Timing

We are now at the end of the second week of the Google Summer of Code. My project is composed of two big parts:
- The query parser (in the sense of "something that takes a query string and produces a Nepomuk2::Query::Query object")
- A "query builder widget", a widget used by Dolphin when you click on "Search" and that allows the user to build queries without having to understand how the parser works. The query parser recognizes a language close to natural language, but this widget can also help the user by exposing auto-completion suggestions (tag names, contacts, file-names, date-time popups, etc), and decorating the input with some nice colors
The query parser is in good shape and is able to parse queries. When porting the unit tests from the old parser to the new one, I have found that some features still are lacking (like the recognition of file-name wild-cards), but it is already quite usable, and produces positional information.

The query builder widget will be very interesting to develop. The challenges will be graphical (the interface of Ivan is very nice but seems difficult to implement, and even more to implement right) and algorithmic, as the parser will need to be adapted (for instance, if a pattern like "sent by %1" only partially matches, it can be used to add a new auto-completion proposal).

I'm on holidays from July 3 to July 13, so there will be no new blog post during this period. Before that, I will try to advance as much as possible in the implementation of a nice query builder widget. I'm very excited, and nice-looking things always pleased me. After my holidays, I will have nearly two months and a half to finish the query builder widget, fix bugs, and integrate Nepomuk at interesting places (KRunner, Plasma Active, etc). I'm open to any suggestion.

Parsing Human Date-Times

Published on sam 22 juin 2013 in Nepomuk, (Comments)

These last days, there wasn't much activity on my blog and on the Github repository that hosts my work for the summer. That doesn't mean that I did nothing, I was busy implementing date-times and wanted to be sure the direction was right before pushing commits.

Human Date-Times

What I call a "human date-time" is a text describing a date-time, written in human language. This has nothing to do with nice ISO-formatted dates or even dd/mm/YY dates that are written on the calendar. These dates are like "yesterday", "last month", "the first Monday of 2012", that is to say dates that humans use when they are talking to somebody else.

I already tried to parse these dates, and some code can be found in this repository. This code is based on XML files that contain per-locale parsing rules. For instance, "next $period$" allows to parse and recognize "next month", "next year", "next week", etc.

Having the parsing rules in an XML file made this parser difficult to translate. The translators have to copy/paste the English rules and then translating every text they see. They also need to figure out how to add rules specific to their language (or calendar system), and how to remove untranslatable ones without breaking too much things.

New Parsing Passes

For the Nepomuk query parser, I choose to use another approach. Instead of having per-locale rules, the parser has a set of C++ passes that handle parts of date-times. Theses passes are:

A pass that sets a value to a period, where the period and possibly also the value are parsed from the query. For instance, "next month" sets +1 to "month", and "3 days ago" sets -3 to "days". Absolute values are also supported, like in "first week".
A pass that sets values to periods, where periods are fixed and values are parsed from the query. This is "May 8" (the 8 is always a day, May is a month name), "2013-04-04" (formal date-times are also parsed), and times.
A last pass that transforms month and day names into values. "Monday" becomes 1, "Tuesday" becomes 2, etc. The pass relies on KCalendarSystem, that starts the week on Monday.

These passes are fairly generic. For instance, the second one, that sets values to fixed periods, always recognizes placeholders numbered from %1 to %7, where %1 is a year and %7 is a second. One big rule matches anything that sets a period, from "6th of June" to "6:30 pm".

d->runPass(d->pass_datevalues, i18nc(
    "A year (%1), month (%2), day (%3), day of week (%4), hour (%5), "
        "minute (%6), second (%7)",
    "%3 of %2 %1;%3 (st|nd|rd|th) %2 %1;%3 (st|nd|rd|th) of %2 %1;"
    "%3 of %2;%3 (st|nd|rd|th) %2;%3 (st|nd|rd|th) of %2;%2 %3 (st|nd|rd|th);"
    "%1 - %2 - %3;%1 - %2;%3 / %2 / %1;%3 / %2;"
    "in %2 %1; in %1;, %1;"
    "%5 : %6 : %7;%5 : %6;%5 h;"
));

This is a bit difficult to read, but this way, with a simple i18nc call, every language can provide its own set of rules.

The pass that also sets the period cannot be used with only one rule. The pass must be configured. For instance, "last %1", "first %1" and "next %1" all match only one period, but they have different meanings. The meaning is configured before the pattern is matched:

1
2
3

d->pass_dateperiods.setKind(PassDatePeriods::VariablePeriod, PassDatePeriods::Offset, 1);
d->runPass(d->pass_dateperiods,
    i18nc("Adding 1 to a period of time", "next %1"));

The kind of a period is whether it is a day, year, month, etc. When it is VariablePeriod, the kind is parsed from the query (it is the %1 in the patterns). It is also possible to force the period to a specific value, as "yesterday" is a day, but nothing apart the word says that:

1
2
3

d->pass_dateperiods.setKind(PassDatePeriods::Day, PassDatePeriods::Offset, -1);
d->runPass(d->pass_dateperiods,
    i18nc("One day ago", "yesterday"));

How Date-Times are Parsed

The parsing passes take terms that match a given pattern, and output replacement terms. For instance, the pass that recognizes file sizes replaces "3 KB" with "nfo:fileSize=3000".

For a date-time, it is a bit more difficult. Each pass only sees a part of the date-time (it would be way too complex to try to parse every complete date-time possible using only one pass). For instance, a pass recognizes "first Monday" and knows that the final date must be the first Monday of the enclosing period. If I add "of June", another pass recognizes the month name and sets the period to "month". Finally, when all the parsing is done, the two fragments are fused, producing a valid date-time literal.

Each date-time piece of information is represented as a ComparisonTerm whose property is something like date://dayofweek/offset, where offset can also be value if the value is absolute ("3rd week of May" vs "next week").

A query like "first Monday of June" is therefore parsed to "date://dayofweek/value=1 AND date://month/value=6". A special pass is run between all the parsing passes that recognize date-time fragments and property passes that may need valid date-time literals. This pass is responsible for fusing fragments. It works by grouping adjacent fragments together in a DateTimeSpec, a big structure that contains information about years, months, weeks, days, days of week, hours, minutes and seconds.

When a DateTimeSpec is complete (its group ends, because the query is finished or the user put something else instead of a date fragment, so "mails sent yesterday related to files created last month" will correctly produce to date-times), it is transformed to a valid date-time.

This last step is very difficult, as the parser has to handle relative date-times (easy, start from now and add years/months/whatever) and absolute ones. The absolute date-times are never totally absolute, as "June 6" is not June 6 of year 1, but June 6 of this year. Absolute and relative fragments can even be mixed, like in "June 6 of next year"¹ that gives an absolute month and day, and a relative year.

You can take a look at Parser::Private::buildDateTimeLiteral (in parser.cpp). This method builds a QDateTime object, and returns a literal term representing it. This term can now be used to build comparisons.

Date intervals

Even if the parser is able to build date-times, it is not able to handle intervals. For instance, when I write "mails sent in 2011", I want to have a list of all the mails sent between January 1, 2011 and December 31, 2011 included. The problem is that the parser is not yet able to produce this kind of interval. If I parse this query with the current parser, I get:

<and>
    <type uri="nmo#Email"/>
    <comparison property="nmo#sentDate"
                comparator="="
                inverted="false">
        <literal datatype="XMLSchema#dateTime">
            2011-01-01T14:32:03Z
        </literal>
    </comparison>
</and>

In fact, no mail was sent exactly at this date-time, so I will not get any result.

A possible solution is to replace date-time literals with comparisons (or, as there is no "Between" comparison operator, an AND term). I will try to do that this evening or tomorrow, as I want to start experimenting with syntax highlighting next week.

In fact, the "of" in this query is currently not recognized by any pattern, and will remain in the parsed query. The two fragments will therefore be grouped into two different groups, thus producing two date-times. I still need to figure out how to avoid this. ↩

Parsing Nested Queries

Published on mer 19 juin 2013 in Nepomuk, (Comments)

At the beginning of this week, I have added two new features to my Nepomuk query parser.

The first one is logic operators. By default, terms or comparisons that are simply separated by spaces and ANDed. For instance "holidays size > 2M" matches documents containing "holidays" and having a size greater than 2 megabytes. Supported operators are AND, OR and NOT (that can also be written "!"). These operators can be enclosed into braces, to impose a specific grouping. Every operator has the same priority, and they are left-associative: "a AND b OR c" is parsed to "(a AND b) OR c".

The second new feature is nested queries. Nested queries are used to match documents related to others, or contained in a specific archive, or sent as attachments to an email, etc. The pattern syntax for this is "related to ... ,", where "..." matches a list of terms, and the comma indicates that this list of matched terms ends when a comma or the end of the query is encountered. The parsing pass receives the list of the tokens matched by the "...", and can build a subquery with them.

The query I used to test this feature is "files related to e-mails sent by Denis, size > 2M". This query is parsed to this tree of terms:

<and>
    <type uri="nfo#FileDataObject"/>
    <comparison property="nie#relatedTo"
                comparator=":"
                inverted="false">
        <and>
            <type uri="nmo#Email"/>
            <comparison property="nmo#messageFrom"
                        comparator=":"
                        inverted="false">
                <literal datatype="XMLSchema#string">
                    Denis
                </literal>
            </comparison>
        </and>
    </comparison>
    <comparison property="nfo#fileSize"
                comparator="&gt;"
                inverted="false">
        <literal datatype="XMLSchema#long">
            2097152
        </literal>
    </comparison>
</and>

You see that everything seems to work. "2M" is converted to an integer, and the subquery starts at line 6 (the and) and ends at line 15 (the end of the and).

The next steps are the following:

Parsing tags (using patterns like "tagged as %1", "(has|have) tag %1" and "# %1", the hashtag syntax)
Getting range information about properties. This will allow the parser to know that "fileSize" needs an integer (and not a string for instance). This feature will be especially useful for the auto-completion box, as it will allow the box to show drop-down lists for known property ranges. For instance, typing "has tag" will list you tags, "sent by" will display your contacts, and "modified on" could even show a calendar!
Parsing date-times. I want to see if my parser can handle all the parsing rules of the current KHumanDateTimeParser I have developed some weeks ago.

I hope to have most of that ready by the start of next week. This way, I will be on schedule and I will be able to start thinking about the syntax-highlighting and auto-completion.

More and simpler patterns

Published on lun 17 juin 2013 in Nepomuk, (Comments)

Yesterday, I continued my work on the Nepomk Query parser (currently out of tree, available here), and I have enhanced several aspects of it. My aim was to have it clean and powerful enough to be able to parse completely the example query I gave in this entry. Even thought I'm not quite there yet, most of the query can already be parsed.

Regular expressions and unit splitting

The first two modifications of yesterday are ones I talked about in my previous blog entry.

Sometimes, matching exact words in rules is not enough. For instance, you may want to match words that can be in singular or plural form. In English, this is simple (just duplicate the pattern and write one in the singular form, the other in the plural form), but other languages have declinations or complex word forms.

The philosophy of the parser here is to match everything the user wants to be matched, even if it is a mistake. For instance, there is no need to reject queries having plural and singular forms interleaved, as they may be typos or the user just made a mistake.

The syntax of these regular expressions is simple : everything that is not a placeholder in a pattern is a regular expression, so "sent by %1" contains two very simple regexps, and "%1 (th|rd|nd) of %2" contains a more complex one.

Another modification of the parser was needed due to the fact that units can be part of an integer, or be separate from it. For instance, I can write "3GB" or "3 GB". I don't know if only one of these spelling is correct in English, but the parser has to handle the two as other language could even swap the unit and the number, or put words in-between them.

A special parser pass now separates units from numbers. The locale provides a list of suffixes/prefixes (they are matched at the two ends of a number). Currently, the English locale provides the standard file-size units (MiB, GiB, etc), the base-10 version of them (used by Windows, this is MB, GB, KB, etc), and ordinal suffixes (th, rd, nd). These units are matched case-insensitively. When a number is followed by or prefixed with one of these words, the word is split apart. "3GB" becomes "3 GB".

Comparison operators

When I started to think about this parser, I wanted each value (a name, a number, everything that is not a property name) to have a type hint. If I enter "3GB", I want to match a file size. If I enter "#something", I want to match documents that are tagged as something.

The problem is that the Nepomuk classes I use (the ones in Nepomuk2::Query, that are very well designed by the way) do not allow me to store arbitrary type hints into value. Sure, if the value is a constant, I know that it is a string, an integer or a date, but nothing more precise.

The solution I am currently experimenting with is to replace values with comparisons. Plain strings and integers remain literal values (constants), but things like "3GB" become comparisons. The property which is compared to the value is the type hint, and the operator is the equality. So, "3GB" is parsed to "nfo:fileSize=3GB".

Now, the default property or the operator may be wrong. The operator can be modified using the PassComparators pass. Patterns like "less than %1" are matched, and if "%1" is a comparison term, its operator is changed to the correct value. If "%1" is a literal value, a new comparison is created. Its value is the literal term, its operator is the one corresponding to the pattern, and its property is null.

The parser is therefore able to parse "> 3KB" to nfo:fileSize>3000, and when dates will be implemented, "mails sent before June 13" will produce the expected query.

Simpler pattern language

You may remember that my previous blog post presented a pattern language that was fairly complex (it was also complex to parse). Patterns were made of words, separated by spaces, that could be literals to be matched exactly or placeholders. A placeholder was like "term0", "string1", etc.

The first modification was to match literals not exactly but using regular expressions. The second one was to simplify the placeholders.

When I write new parsing passes, I know exactly what kind of terms I want to match. Me, as a developer, I know that the first term matched must be a string, and the second one can be anything ("term1"). If one of these placeholder is changed, the parsing pass may cease to work correctly.

So, I thought that translators do not have to (and must not) change the placeholder type. They just need to reorder them if needed and to put words between them. They are only interested in their position, not their meaning. I therefore replaced the complex matching syntax with simple numbers prefixed with a percent, like the ones you use with QString::arg. Now, "sent by " becomes "sent by %1".

Matching properties

In order to experiment the architecture of the parser, I wrote a small parser pass that recognizes "sent by X" and changes that to nmo:messageFrom=X. Yesterday, I wanted to add more properties, and I noticed that every parser pass would be the same, except that the property matched would be different. All the work is done by the pattern matcher.

So, I replaced the "sent by" pass with a more general "property" one. This avoids code duplication and allows to quickly add patterns for new properties. No need to write a pass, just add a line like this in parser.cpp:

d->pass_properties.setProperty(
    Nepomuk2::Vocabulary::NMO::messageSubject());
progress |= d->runPass(d->pass_properties,
    i18nc("Title of an e-mail", "(en)?titled? %1;title is %1;%1 as title"), 1);

The pattern in this example seems awkward, but allows the parser to match "mails having Foo as title", "documents entitled Foo", "file whose title is Foo", etc. This complex pattern shows all the features of the pattern matcher, and is normally not too difficult to translate. The semicolon separates independent rules of the same pattern. I changed the pipe separator to a semicolon because the pipe can now be used in regular expressions.

All in all, the parser is advancing, and even if it is still an experiment, it seems that its architecture is not too bad for what I want to do. In the coming days, I will implement nested queries (explicitly enclosed in parenthesis by the user, or matched by patterns like "related to ... ," the comma being an indicator of the end of the subquery, like in "documents related to mails sent by Jeff, and having a size smaller than 2MB"), and the actual production of the Nepomuk query (Nepomuk2::Query::Query, a very powerful and well-designed class by the way).

Another thing I want to do is to have a mean to get metadata about properties. The most interesting one is their range. For instance, if I enter "mails sent by X", I must know that X has to be a nco:PersonContact. One general solution would be to maintain a database of metadata about properties (I know that Nepomuk already does that, but uses queries to the Nepomuk server in order to retrieve the properties, this may be a bit too slow for being used by the parser). Another one would simply to hard-code the metadata into the PassProperties::setProperty method calls.

I also need to think about the auto-completion and syntax-highlighting. The syntax-highlighting only requires that terms have a positional information (to be able to match parsed terms to positions into the input query), but the auto-completion needs the pattern matcher to be able to partially match patterns (and to report what was expected at the location of the cursor), and will use the property ranges to show nice drop-down lists (if I enter "mails sent by", I will see the list of my contacts).

Finding Information in Human-Entered Search Queries

Published on sam 15 juin 2013 in Nepomuk, (Comments)

The coding period of this year's Google Summer of Code starts this Monday, and I have just finished my exams. It is a good time to start working on my project, a Nepomuk query parser.

When I learned that I was accepted as a GSoC student, I began defining a grammar that will be used to parse Nepomuk search queries. I sent several mails to the nepomuk mailing list, and received very good feedback.

Formality and User-Friendliness

My first mail described a very formal grammar, very much like the one used by the current parser, that uses a key:value syntax (for instance, reservation house hasTag:holidays). The first comments said that this was not user-friendly enough, and pointed me to alternative syntaxes. In fact, I had only a small idea of what would be user-friendly, so these examples greatly helped me.

I then proposed a small modification of my syntax. The first version was purely key:value based, and relied heavily on special characters, parenthesis, colons, operators, etc. The second version made most of these special characters optional, and allowed keys to span multiples words. sent_by:Me could then be replaced by sent by Me, a far more user-friendly syntax.

Simplifying the Parser

The problem with my second syntax is that it is fairly complex. It is a full formal grammar, with nested queries, and even with that the parser is unable to parse something that is not a sequence of key with spaces value statements. In English, nearly everything can be expressed with these kinds of sequences, but other languages in the world need to split the key around the value (if they have detached postfixes for instance), and even "sent last week to Jimmy" cannot be parsed, even if it is a simple rewording of "sent to Jimmy, sent last week".

After having discussed a bit with Vishesh Handa, my mentor, I began thinking about how a parser could handle these kinds of weird queries. Vishesh also told me that I do not need to keep the new syntax compatible with the old one, so I can simplify my reflection.

Before being accepted as a GSoC student, I developed a small library that parses human-entered date-times. "first Monday of next month" will be parsed to the date of the first Monday of next month. This "parser" is not really one following the strict definition of the term. The date-time is not parsed from left to right (from start to finish), but information is cherry-picked. Rules say for instance that "first X", where X is a day, sets the day of the current period to 0. "next X" is also a rules, and here is responsible for incrementing the month number.

This approach is not very formal, but allows rules to be written independently of the parser. The parser ceases to be a parser and becomes a scripting environment. In fact, the human date-time parser used per-locale XML files and relied on KDE's KCalendarSystem class.

For the Nepomuk query parser, I plan to use an approach like this, but a bit enhanced. The idea of rules is kept, but rules become more powerful and less painful to translate.

Architecture of the Parser

I will explain the general architecture of the parser using an example query made of words, English-specific constructs and key/value pairs:

1	e-mails sent by Jimmy, containing "hot dogs", and received before June 13

Don't be afraid of this query, the plain list of keywords is also recognized by the parser.

The first step is the splitting of the query into words. The split is operated at every character for which QChar::isSpace returns true. Normally, this covers every spacing character for every language handled by the Unicode character set.

When the words are split, the parser uses them to build a list of Nepomuk2::Query::LiteralTerms. This list can be used to search for documents by keywords. The next passes are only here in order to improve the search results, and also to change the simple literal values into something more precise, that can be used for auto-completion when needed.

1	(e-mails) (sent) (by) (Jimmy) (,) (containing) (hot dogs) (and) (received) (before) (June) (13)

The first pass transforms string constants into constants of a more suitable type. For instance, "13" is converted to 13 (the integer), "3.14" becomes a double, "2MB" becomes 2.000.000, and finally "2MiB" is converted to 2.097.152.

1	(e-mails) (sent) (by) (Jimmy) (,) (containing) (hot dogs) (and) (received) (before) (June) (int:13)

Another pass that I have already implemented is what I call type hints. If you have a generic search interface (like KRunner or the equivalent in Plasma Active), the queries entered in them can concern any data indexed on the computer, like files, e-mails, photos, etc. Even if we only think about Dolphin, Nepomuk knows that a file is not only a file, but a document, a picture, a movie, etc. This second pass recognizes type hints in the query and transforms them into filters on the document type.

1	(ResourceType=nmo:Email) (sent) (by) (Jimmy) (,) (containing) (hot dogs) (and) (received) (before) (June) (int:13)

The third pass already implemented matches "sent by X" and "from X" and adds a filter on the sender of an e-mail

1	(ResourceType=nmo:Email) (nmo:messageFrom=Jimmy) (,) (containing) (hot dogs) (and) (received) (before) (June) (int:13)

Currently, my small experiment does not get the list of known contacts from Nepomuk, but instead tries to directly compare the nmo:messageFrom property with a string. I is not correct and will not work, but I first wanted to have a working parser before being able to generate real queries.

The next passes are not already implemented but will be in the coming days or weeks. One of them could match "containing X" and change the query like this (bif:contains is what the current parser seems to use to match documents containing something, but there may be a better way of doing this):

1	(ResourceType=nmo:Email) (nmo:messageFrom=Jimmy) (,) (bif:contains="hot dogs") (and) (received) (before) (June) (int:13)

The next one matches a day of a month. In English, the month name comes before the day number, but other languages do things differently. This kind of pattern matching can be done using a small pattern language developed for the parser. Each word in the pattern is either an immediate value (it matches a literal value having exactly this value) or a constraint on the term to be matched. For instance, June <integer0> matches "June 13". The 0 after integer allows the translators to move placeholders around in the pattern, everything will be kept ordered on the C++ side.

So, a day-of-month rule could be <string0> <integer1> in English. The C++ handler for this pattern will first check that string0 is a month name and integer1 is comprised between 1 and 31. Then, the according filter can be output. This rule can be translated in French to <integer1> <string0> as month names and day numbers are inverted in French (we say "13 Juin"). In fact, even in English the phrase can be "13th of June". No problem, the parser accepts multiple patterns for a rule:

d->runPass(d->pass_dayofmonth, i18nc(
    "Day number of a month",
    "<string0> <integer1>|<integer1> (?:th|rd|nd) of <string0>|<integer1> <string0>"
), 2); // 2 = there are two arguments in the pattern

Translators are free to add as many rules as they want. Notice that regular expressions will be allowed (currently, they are not yet implemented). Another modification to the current parser that has to be done is to split "3rd" into "3" and "rd". This could also be used to split "3GB" into "3" and "GB", thus simplifying the handling of file sizes.

With that said, the query is now:

1	(ResourceType=nmo:Email) (nmo:messageFrom=Jimmy) (,) (bif:contains="hot dogs") (and) (received) (before) (date:2013-06-13)

The last thing that would be nice to have parsed is the "received before" thing. Here, the parser needs to parse "before " to produce a new date, but with a hint that the comparison to be done with this date is not the equality, but an ordered relation. Then, the parser has to match "received " and do the actual comparison:

1	(ResourceType=nmo:Email) (nmo:messageFrom=Jimmy) (,) (bif:contains="hot dogs") (and) (nmo:receivedDate<=2013-06-13)

Now, there is nothing more to do. The parser stops trying rules (it does so in a loop, like optimization passes of a compiler, each rule being able to possibly alter the query and to expose further refinements), and proceeds to the final step: literal values smaller than a minimum size are removed. This removes the comma and the and. We get the final query:

1	(ResourceType=nmo:Email) AND (nmo:messageFrom=contact:Jimmy) AND (bif:contains=string:"hot dogs") AND (nmo:receivedDate<=date:2013-06-13)

If there was a literal value bigger than the threshold and not matched to anything, it would have been matched against its default property. If the literal value is a string, its default property is bif:contains. For a date, it is a set of properties like the date of creation/receiving, modification, etc.

Current state and future plans

Source code : nepomukqueryparser on GitHub, LGPLv2.1+

Currently, the parser is completely experimental and cannot be used to do anything even remotely useful. I blog about it in order to have comments about the current implementation direction, and to present how I see the parser. I like to communicate regularly on my blog, my average for my GSoC two years ago was one blog post per day or two. Feel free to comment and to find examples of difficult to parse queries, I will try to organize and architecture the parser in order to have it able to parse your query (if it is user-friendly enough).

Another point is that my GSoC project is also about implementing an auto-completed input field. I currently have thought of two strategies that can be used to present auto-completion propositions to the user:

Patterns are matched from left to right. If the start of a pattern matches (say, more than 50%, or anything containing an immediate value like a keyword), the end of the pattern can be presented to the user. For instance, if I enter "sent by", the parser knows that the last part of the pattern is a contact, and the auto-complete box can show the list of my contacts.
Terms are typed. When the user edits the query, his or her cursor is not at the end of the input field, and the parser must be more careful (the user may be breaking already-parsed statements or inserting things between them). The already-parsed terms can be used to know what the user is editing. If the user places its cursor on something that was found to be a date-time, a calendar can be shown.

The current implementation is able to parse things like "mails sent by Jimmy", and gives the following result:

<?xml version="1.0"?>
<type uri="http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#Email"/>

<?xml version="1.0"?>
<comparison property="http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#messageFrom" comparator="=" inverted="false">
    <literal datatype="http://www.w3.org/2001/XMLSchema#string">Jimmy</literal>
</comparison>

I will post more details in the coming days, as I advance in the parser and solve different problems.

Writing a new Nepomuk Query Parser during the Summer

Published on mar 28 mai 2013 in Nepomuk, (Comments)

Hello everyone !

I'm Denis Steckelmacher, a Belgian student in the end of its second year of Computer Science at the Free Univerisity of Brussels. Yesterday, I had the pleasure to learn that my Google Summer of Code proposal, "A new Query Parser and Auto-Completed Input Field for Nepomuk", was accepted by KDE.

Project summary

Nepomuk is the software that indexes files, mails, contacts, and everything you ask it to index on your computer. All these resources are stored somewhere for future use, and are linked between them. The main goal of Nepomuk is not only to store resources, but also to find them. When I type something in KRunner and that files containing this or these words appear in the list, it is Nepomuk in action. When I'm looking for a mail, I think it is also Nepomuk that is used (in recent versions of KMail).

The goal of my project is to make this search experience more easy and enjoyable for us, humans. Currently, "queries" entered in a search field are parsed by a good enough parser that can understand words, tags, and some other things. My project is about rewriting this parser to be more extensible, and I will strive to make it the most human-friendly as possible.

For instance, it will be my first parser that will have an ambiguous grammar, because humans are ambiguous. For instance, if I type "two days ago" in a search field, I may want to find documents containing "two days ago" (I sent a mail to my sister and I only remember that this sentence was in it), or any of these three words, or any document created or edited two days ago. Even filtering documents by date is ambiguous, as I may want to find documents that were exactly touched two days ago, or I may also want to find the ones touched yesterday and this morning.

Add to this the possibility to filter by properties, for instance "size<=4M" or "hasTag:holidays", and you can get something very complex. Think about what Google allows you to write in its search bar ("my word site:mysite.com" for instance).

Action plan

The first thing I need to do is to discuss with the Nepomuk developers and people in charge of the localization. We need to define a precise yet human-friendly (and more importantly, implementable) grammar that will be understood by the parser. I hope to be able to put most of the features I want in the parser, but it may be very difficult to make it understand correctly what users without any technological knowledge will want to search for.

The time-line for this project is described near the end of my proposal. To summary it a bit, the aforementioned discussion will take place from now to the middle of June (and will continue if needed, but I hope to already have some feedback by the middle of June). Then, I will implement the parsing infrastructure. I need to design an Abstract Syntax Tree (a computer-friendly and unambiguous way of representing information) and the beginning of the parser.

During July, I will begin the second half of my proposal : a line edit that provides syntax highlighting and auto-completion. This way, users will be helped and will see how their query is understood by the system. The auto-completion will show them what is possible (I like to empower the users). For instance, after having typed "date:", the user will see "enter a date" that will open a calendar if clicked on, and "describe a date" with "last week" shown as example.

The syntax highlighting will also be a great debugging tool, as I will be able to see how the parser understands or misunderstands what I enter. Some features will be a bit tricky, for instance the optionality of quotes and nearly everything (I want "school date:two days ago" to be parsable, and my dream would be that "school, two days ago" or "school date:two days ago size < 4M" are also understood).

Any progress will be described on my blog that I hope will be registered on Planet KDE. Comments are welcome for any blog entry, and I they don't work (I use Disqus that I have never used before), don't hesitate to send me a mail (my mail account at yahoo.fr is steckdenis).

Current status

As my proposal was just accepted, I did not already have time to do very fancy things. The only code I developed before my proposal was accepted, because the feasibility of my proposal depends on it, is an human date-time parser.

This class is responsible for translating "Monday of last week" into 2013-05-20 (if the current date is 2013-05-28). It does that by matching locale-specific rules on the input string. Each rule is accompanied by actions to be performed on the date-time, that is at the beginning of the parsing set to the current date and time. The order of the rules is important as they have to be matched in the right order to be meaningful. For instance, "last week" will first be matched and will remove 7 days from the current date-time. Then, "Monday" is matched and will set the current day-of-week to Monday.

This code is as locale-independent as possible. There is no string matching in the code, no dependence on a specific character set, and everything, month names, numbers, etc, is read from the parsing rules. The advantage is that languages I really don't know will still be able to use this parser to parse dates, even if they are in another calendar system. The inconvenient is that parsing rules are complex, currently written in a big XML file with special syntax for placeholders (example below), and using regular expressions. The KDE translators seemed ok with that, but I don't want the parsing rules to be so complex that it takes months to have a nice set of parsing rules for every language.

<!-- Periods in this language (simple translations with metadata if needed) -->
<period type="year" value="10">decades?</period>
<period type="year">years?</period>
<period type="month" value="3">seasons?</period>
<period type="month">months?</period>

<values name="month">
    <value value="1">january</value>
    <value value="2">february</value>
    <!-- ... -->
</values>
<values name="number">
    <value value="1">one</value>
    <value value="1">a</value>
    <value value="1">first</value>
    <value value="2">two</value>
    <!-- ... -->
</values>

<!-- Parsing rules, here the documentation becomes necessary -->
<rule pattern="$number$ %period% ago">
    <sub value="$1" />  <!-- Remove $1 (the number) from the given period -->
</rule>
<rule pattern="in $number$ %period%">
    <add value="$1" />  <!-- Add the number to the given period -->
</rule>
<rule pattern="next %period%">
    <add value="1"/>    <!-- Add 1 to the given period -->
</rule>

The complete file is en_US.xml in the Github repository of this project.

« Previous Page 6 / 6