Parsing Nested Queries - Denis Steckelmacher

Published on mer 19 juin 2013 in Nepomuk, (Comments)

At the beginning of this week, I have added two new features to my Nepomuk query parser.

The first one is logic operators. By default, terms or comparisons that are simply separated by spaces and ANDed. For instance "holidays size > 2M" matches documents containing "holidays" and having a size greater than 2 megabytes. Supported operators are AND, OR and NOT (that can also be written "!"). These operators can be enclosed into braces, to impose a specific grouping. Every operator has the same priority, and they are left-associative: "a AND b OR c" is parsed to "(a AND b) OR c".

The second new feature is nested queries. Nested queries are used to match documents related to others, or contained in a specific archive, or sent as attachments to an email, etc. The pattern syntax for this is "related to ... ,", where "..." matches a list of terms, and the comma indicates that this list of matched terms ends when a comma or the end of the query is encountered. The parsing pass receives the list of the tokens matched by the "...", and can build a subquery with them.

The query I used to test this feature is "files related to e-mails sent by Denis, size > 2M". This query is parsed to this tree of terms:

<and>
    <type uri="nfo#FileDataObject"/>
    <comparison property="nie#relatedTo"
                comparator=":"
                inverted="false">
        <and>
            <type uri="nmo#Email"/>
            <comparison property="nmo#messageFrom"
                        comparator=":"
                        inverted="false">
                <literal datatype="XMLSchema#string">
                    Denis
                </literal>
            </comparison>
        </and>
    </comparison>
    <comparison property="nfo#fileSize"
                comparator="&gt;"
                inverted="false">
        <literal datatype="XMLSchema#long">
            2097152
        </literal>
    </comparison>
</and>

You see that everything seems to work. "2M" is converted to an integer, and the subquery starts at line 6 (the and) and ends at line 15 (the end of the and).

The next steps are the following:

Parsing tags (using patterns like "tagged as %1", "(has|have) tag %1" and "# %1", the hashtag syntax)
Getting range information about properties. This will allow the parser to know that "fileSize" needs an integer (and not a string for instance). This feature will be especially useful for the auto-completion box, as it will allow the box to show drop-down lists for known property ranges. For instance, typing "has tag" will list you tags, "sent by" will display your contacts, and "modified on" could even show a calendar!
Parsing date-times. I want to see if my parser can handle all the parsing rules of the current KHumanDateTimeParser I have developed some weeks ago.

I hope to have most of that ready by the start of next week. This way, I will be on schedule and I will be able to start thinking about the syntax-highlighting and auto-completion.