Indeed, to apply an SRQL query, the information resource is analyzed to and generate the corresponding user image, i.e., an image which shows how the information actually looks like to the user. Then, an appropriate box model is built to encapsulate all the distinguished objects from the visual appearance, possibly associating a set of attributes, deducible from the source information, to each box. Depending on the particular IE data domain, such attributes characterize specific properties of the objects, which may be both visual, (e.g., color of the box, text contained in the box, etc.), and semantic (e.g., if the box has a particular predefined meaning in the information).
To give an idea, it is be possible to extract the tables on the right of an image, or the text below the first/last image, or the links between two text blocks within a web page, or the text of a certain color within a paragraph beginning with a determined word within a pdf file, etc.
The basic structure of a SRQL query is the shown below.
SELECT [FIRST | LAST | ( CLOSEST [LEFT | RIGHT | UP | DOWN] '(' <variable> ')' )] BOX | <property> [',' <property>...] FROM <uri> WHERE <boolean expression using spatial relations and the BOX variable> HAVING <boolean expression on property values> ORDER BY <property> | XPOS '(' UL | LR ')' | YPOS '('UL | LR ')' [ASC | DESC] [ ',' ... ] WITH <variable> '=' [<uri>|<coordinates>|<sub-query>][',' <variable> '=' ...]
A query is composed of the following clauses.
This mandatory clause defines the information to be extracted by the query. According to the spatial relation theory, the language allows to select specific boxes from the box model associated to the data, and return their graphical representation (e.g., a graphic snapshot), specified through the BOX keyword, and/or a comma separated list of their attributes. Moreover, the SRQL syntax allows to use several operators to refine the selection. Among these, the FIRST and LAST operators make the SELECT return only the first or the last extracted box, respectively, with respect to the ordering given by the ORDER BY clause, and the CLOSEST [UP, DOWN, LEFT, RIGHT] operators select only the extracted box that is closest to a given reference box, with respect to one of the four basic spatial directions.
This mandatory clause contains the URI identifying the resource to be queried.
This clause contains a boolean expression that determines if a box has to be selected by the query. The expression is built by combining the spatial relations with logical operations.
Relations can be applied to the keyword BOX (which represents the box being evaluated by the query) and to the box variables defined in the WITH clause. These variables can generally represent set of boxes. SRQL also supports the keywords ANY and ALL to compare single boxes with set of boxes, using the same semantics of SQL. Finally, a further keyword EACH is provided to support more complex extractions, with the following meaning. Given the expression {BOX relation EACH B, for each box b in the set B, the EACH construct calculates the set of all the boxes a such that a relation b holds. The overall evaluation allows to collect and return all these boxes a.
This clause defines the ordering of the results. The ordering key can be any box property (see the HAVING clause below) or the coordinates, specified by the keywords XPOS and YPOS, respectively, of the upper-left (UL) or the lower-right (LR) points of the box. The ordering can be ASCending or DESCending, and more than one ordering key can be specified.
This clause is used to define some particular (sets of) boxes on the box model and assign them to variables to be used in the WHERE clause. The tool automatically defines the START variable to point to the upper left corner of the user image. Other boxes can be specified using their absolute coordinates or as the result of a nested query.
This clause contains a boolean expression using the common comparison operators applied to property identifiers. If this clause is specified, the expression is evaluated on each box selected by the WHERE clause, and only the ones satisfying both the WHERE and the HAVING expressions are returned by the query. In particular, when the box has a textual content, the LIKE operator can be used to match it against a regular expression.