More and simpler patterns

Published on lun 17 juin 2013 in Nepomuk, (Comments)

Yesterday, I continued my work on the Nepomk Query parser (currently out of tree, available here), and I have enhanced several aspects of it. My aim was to have it clean and powerful enough to be able to parse completely the example query I gave in this entry. Even thought I'm not quite there yet, most of the query can already be parsed.

Regular expressions and unit splitting

The first two modifications of yesterday are ones I talked about in my previous blog entry.

Sometimes, matching exact words in rules is not enough. For instance, you may want to match words that can be in singular or plural form. In English, this is simple (just duplicate the pattern and write one in the singular form, the other in the plural form), but other languages have declinations or complex word forms.

The philosophy of the parser here is to match everything the user wants to be matched, even if it is a mistake. For instance, there is no need to reject queries having plural and singular forms interleaved, as they may be typos or the user just made a mistake.

The syntax of these regular expressions is simple : everything that is not a placeholder in a pattern is a regular expression, so "sent by %1" contains two very simple regexps, and "%1 (th|rd|nd) of %2" contains a more complex one.

Another modification of the parser was needed due to the fact that units can be part of an integer, or be separate from it. For instance, I can write "3GB" or "3 GB". I don't know if only one of these spelling is correct in English, but the parser has to handle the two as other language could even swap the unit and the number, or put words in-between them.

A special parser pass now separates units from numbers. The locale provides a list of suffixes/prefixes (they are matched at the two ends of a number). Currently, the English locale provides the standard file-size units (MiB, GiB, etc), the base-10 version of them (used by Windows, this is MB, GB, KB, etc), and ordinal suffixes (th, rd, nd). These units are matched case-insensitively. When a number is followed by or prefixed with one of these words, the word is split apart. "3GB" becomes "3 GB".

Comparison operators

When I started to think about this parser, I wanted each value (a name, a number, everything that is not a property name) to have a type hint. If I enter "3GB", I want to match a file size. If I enter "#something", I want to match documents that are tagged as something.

The problem is that the Nepomuk classes I use (the ones in Nepomuk2::Query, that are very well designed by the way) do not allow me to store arbitrary type hints into value. Sure, if the value is a constant, I know that it is a string, an integer or a date, but nothing more precise.

The solution I am currently experimenting with is to replace values with comparisons. Plain strings and integers remain literal values (constants), but things like "3GB" become comparisons. The property which is compared to the value is the type hint, and the operator is the equality. So, "3GB" is parsed to "nfo:fileSize=3GB".

Now, the default property or the operator may be wrong. The operator can be modified using the PassComparators pass. Patterns like "less than %1" are matched, and if "%1" is a comparison term, its operator is changed to the correct value. If "%1" is a literal value, a new comparison is created. Its value is the literal term, its operator is the one corresponding to the pattern, and its property is null.

The parser is therefore able to parse "> 3KB" to nfo:fileSize>3000, and when dates will be implemented, "mails sent before June 13" will produce the expected query.

Simpler pattern language

You may remember that my previous blog post presented a pattern language that was fairly complex (it was also complex to parse). Patterns were made of words, separated by spaces, that could be literals to be matched exactly or placeholders. A placeholder was like "term0", "string1", etc.

The first modification was to match literals not exactly but using regular expressions. The second one was to simplify the placeholders.

When I write new parsing passes, I know exactly what kind of terms I want to match. Me, as a developer, I know that the first term matched must be a string, and the second one can be anything ("term1"). If one of these placeholder is changed, the parsing pass may cease to work correctly.

So, I thought that translators do not have to (and must not) change the placeholder type. They just need to reorder them if needed and to put words between them. They are only interested in their position, not their meaning. I therefore replaced the complex matching syntax with simple numbers prefixed with a percent, like the ones you use with QString::arg. Now, "sent by " becomes "sent by %1".

Matching properties

In order to experiment the architecture of the parser, I wrote a small parser pass that recognizes "sent by X" and changes that to nmo:messageFrom=X. Yesterday, I wanted to add more properties, and I noticed that every parser pass would be the same, except that the property matched would be different. All the work is done by the pattern matcher.

So, I replaced the "sent by" pass with a more general "property" one. This avoids code duplication and allows to quickly add patterns for new properties. No need to write a pass, just add a line like this in parser.cpp:

1
2
3
4
d->pass_properties.setProperty(
    Nepomuk2::Vocabulary::NMO::messageSubject());
progress |= d->runPass(d->pass_properties,
    i18nc("Title of an e-mail", "(en)?titled? %1;title is %1;%1 as title"), 1);

The pattern in this example seems awkward, but allows the parser to match "mails having Foo as title", "documents entitled Foo", "file whose title is Foo", etc. This complex pattern shows all the features of the pattern matcher, and is normally not too difficult to translate. The semicolon separates independent rules of the same pattern. I changed the pipe separator to a semicolon because the pipe can now be used in regular expressions.

All in all, the parser is advancing, and even if it is still an experiment, it seems that its architecture is not too bad for what I want to do. In the coming days, I will implement nested queries (explicitly enclosed in parenthesis by the user, or matched by patterns like "related to ... ," the comma being an indicator of the end of the subquery, like in "documents related to mails sent by Jeff, and having a size smaller than 2MB"), and the actual production of the Nepomuk query (Nepomuk2::Query::Query, a very powerful and well-designed class by the way).

Another thing I want to do is to have a mean to get metadata about properties. The most interesting one is their range. For instance, if I enter "mails sent by X", I must know that X has to be a nco:PersonContact. One general solution would be to maintain a database of metadata about properties (I know that Nepomuk already does that, but uses queries to the Nepomuk server in order to retrieve the properties, this may be a bit too slow for being used by the parser). Another one would simply to hard-code the metadata into the PassProperties::setProperty method calls.

I also need to think about the auto-completion and syntax-highlighting. The syntax-highlighting only requires that terms have a positional information (to be able to match parsed terms to positions into the input query), but the auto-completion needs the pattern matcher to be able to partially match patterns (and to report what was expected at the location of the cursor), and will use the property ranges to show nice drop-down lists (if I enter "mails sent by", I will see the list of my contacts).

« Finding Information in Human-Entered Search Queries   Parsing Nested Queries »