Parsing Human Date-Times

Published on sam 22 juin 2013 in Nepomuk, (Comments)

These last days, there wasn't much activity on my blog and on the Github repository that hosts my work for the summer. That doesn't mean that I did nothing, I was busy implementing date-times and wanted to be sure the direction was right before pushing commits.

Human Date-Times

What I call a "human date-time" is a text describing a date-time, written in human language. This has nothing to do with nice ISO-formatted dates or even dd/mm/YY dates that are written on the calendar. These dates are like "yesterday", "last month", "the first Monday of 2012", that is to say dates that humans use when they are talking to somebody else.

I already tried to parse these dates, and some code can be found in this repository. This code is based on XML files that contain per-locale parsing rules. For instance, "next $period$" allows to parse and recognize "next month", "next year", "next week", etc.

Having the parsing rules in an XML file made this parser difficult to translate. The translators have to copy/paste the English rules and then translating every text they see. They also need to figure out how to add rules specific to their language (or calendar system), and how to remove untranslatable ones without breaking too much things.

New Parsing Passes

For the Nepomuk query parser, I choose to use another approach. Instead of having per-locale rules, the parser has a set of C++ passes that handle parts of date-times. Theses passes are:

  • A pass that sets a value to a period, where the period and possibly also the value are parsed from the query. For instance, "next month" sets +1 to "month", and "3 days ago" sets -3 to "days". Absolute values are also supported, like in "first week".
  • A pass that sets values to periods, where periods are fixed and values are parsed from the query. This is "May 8" (the 8 is always a day, May is a month name), "2013-04-04" (formal date-times are also parsed), and times.
  • A last pass that transforms month and day names into values. "Monday" becomes 1, "Tuesday" becomes 2, etc. The pass relies on KCalendarSystem, that starts the week on Monday.

These passes are fairly generic. For instance, the second one, that sets values to fixed periods, always recognizes placeholders numbered from %1 to %7, where %1 is a year and %7 is a second. One big rule matches anything that sets a period, from "6th of June" to "6:30 pm".

1
2
3
4
5
6
7
8
9
d->runPass(d->pass_datevalues, i18nc(
    "A year (%1), month (%2), day (%3), day of week (%4), hour (%5), "
        "minute (%6), second (%7)",
    "%3 of %2 %1;%3 (st|nd|rd|th) %2 %1;%3 (st|nd|rd|th) of %2 %1;"
    "%3 of %2;%3 (st|nd|rd|th) %2;%3 (st|nd|rd|th) of %2;%2 %3 (st|nd|rd|th);"
    "%1 - %2 - %3;%1 - %2;%3 / %2 / %1;%3 / %2;"
    "in %2 %1; in %1;, %1;"
    "%5 : %6 : %7;%5 : %6;%5 h;"
));

This is a bit difficult to read, but this way, with a simple i18nc call, every language can provide its own set of rules.

The pass that also sets the period cannot be used with only one rule. The pass must be configured. For instance, "last %1", "first %1" and "next %1" all match only one period, but they have different meanings. The meaning is configured before the pattern is matched:

1
2
3
d->pass_dateperiods.setKind(PassDatePeriods::VariablePeriod, PassDatePeriods::Offset, 1);
d->runPass(d->pass_dateperiods,
    i18nc("Adding 1 to a period of time", "next %1"));

The kind of a period is whether it is a day, year, month, etc. When it is VariablePeriod, the kind is parsed from the query (it is the %1 in the patterns). It is also possible to force the period to a specific value, as "yesterday" is a day, but nothing apart the word says that:

1
2
3
d->pass_dateperiods.setKind(PassDatePeriods::Day, PassDatePeriods::Offset, -1);
d->runPass(d->pass_dateperiods,
    i18nc("One day ago", "yesterday"));

How Date-Times are Parsed

The parsing passes take terms that match a given pattern, and output replacement terms. For instance, the pass that recognizes file sizes replaces "3 KB" with "nfo:fileSize=3000".

For a date-time, it is a bit more difficult. Each pass only sees a part of the date-time (it would be way too complex to try to parse every complete date-time possible using only one pass). For instance, a pass recognizes "first Monday" and knows that the final date must be the first Monday of the enclosing period. If I add "of June", another pass recognizes the month name and sets the period to "month". Finally, when all the parsing is done, the two fragments are fused, producing a valid date-time literal.

Each date-time piece of information is represented as a ComparisonTerm whose property is something like date://dayofweek/offset, where offset can also be value if the value is absolute ("3rd week of May" vs "next week").

A query like "first Monday of June" is therefore parsed to "date://dayofweek/value=1 AND date://month/value=6". A special pass is run between all the parsing passes that recognize date-time fragments and property passes that may need valid date-time literals. This pass is responsible for fusing fragments. It works by grouping adjacent fragments together in a DateTimeSpec, a big structure that contains information about years, months, weeks, days, days of week, hours, minutes and seconds.

When a DateTimeSpec is complete (its group ends, because the query is finished or the user put something else instead of a date fragment, so "mails sent yesterday related to files created last month" will correctly produce to date-times), it is transformed to a valid date-time.

This last step is very difficult, as the parser has to handle relative date-times (easy, start from now and add years/months/whatever) and absolute ones. The absolute date-times are never totally absolute, as "June 6" is not June 6 of year 1, but June 6 of this year. Absolute and relative fragments can even be mixed, like in "June 6 of next year"1 that gives an absolute month and day, and a relative year.

You can take a look at Parser::Private::buildDateTimeLiteral (in parser.cpp). This method builds a QDateTime object, and returns a literal term representing it. This term can now be used to build comparisons.

Date intervals

Even if the parser is able to build date-times, it is not able to handle intervals. For instance, when I write "mails sent in 2011", I want to have a list of all the mails sent between January 1, 2011 and December 31, 2011 included. The problem is that the parser is not yet able to produce this kind of interval. If I parse this query with the current parser, I get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<and>
    <type uri="nmo#Email"/>
    <comparison property="nmo#sentDate"
                comparator="="
                inverted="false">
        <literal datatype="XMLSchema#dateTime">
            2011-01-01T14:32:03Z
        </literal>
    </comparison>
</and>

In fact, no mail was sent exactly at this date-time, so I will not get any result.

A possible solution is to replace date-time literals with comparisons (or, as there is no "Between" comparison operator, an AND term). I will try to do that this evening or tomorrow, as I want to start experimenting with syntax highlighting next week.


  1. In fact, the "of" in this query is currently not recognized by any pattern, and will remain in the parsed query. The two fragments will therefore be grouped into two different groups, thus producing two date-times. I still need to figure out how to avoid this. 

« Parsing Nested Queries   Merging the parser »