GILP: Symbolic Machine Learning Over Knowledge Graphs
Id  Status  Submission Date  Grant Date  Authorship 

US17/546,450  Granted  12 Sep 2021  12 Dec 2023  100% 
Inductive Logic Programming (ILP) is a branch of machine learning (ML) that describe a family of algorithms for generating programs intended to return a given set of expected results.
GILP is an metarules ILP based method specifically optimized for minimizing remote queries on a graph database where information is stored.
Problem Definition
Given a knowledge base (KB), a set of positive samples (+S) and negative samples (S)^{1} , the ILP algorithm will generate a program (P) capable of returning at least +S while excluding S when executed over KB. ILP could be depicted as a function <KB, +S, S> → <KB, P>
and it is ascribable as NP problem, though GILP increase efficiency by pruning out trees of impossible solutions. GILP is based on metarules that are checked via implementation specific query (Q). When a solution is found, the metarule is used within the Q variable instances to generate an implementation specific program P.
Simple Example
Imagine to create a new relation that does not exists in the KB  for example the relation capitalOf
out of the ConceptNet database, by solely combining existing relations and data. Purpose of the GILP task is to find efficiently the simplest pattern that satisfy +S while excluding S. Consider then the predicate capitalOf(X, Y)
:
The resulting program P will be:
capitalOf(X, Y) ← partOf(X, Y), isA(X, capital), isA(Y, country)
paraphrased in “X is the capital of Y when X is a capital and Y is a country and X is part of Y.”
Metarule
Metarules represent the pattern of possible solutions in the search space. They are abstract structures isomorphic of the final program P. Metarules are the generic platformindependent models (PIM) to be transformed into platformspecific models (PSM).
PSM are linked to a specific information system, in the example above the PSM is transformed into Datalog, a simple logic programming language. Symbols for metarules are defined as following:
 R: relation variable. They are instantiated in P.
 X: final variable name.
 Z: temporary variable. Specific for joins.
 C: constant name. its value will be materialized in the final P.
Metasymbols are differentiated by an incrementing suffix.
When considering the capitalOf
example, the metarule defines the structure of the body of such rule. Given the body:
partOf(X, Y), isA(X, capital), isA(Y, country)
the generating metarule was:
R0(X0, X1), R1(X0, C0), R2(X1, C1)
where metasymbols are substituted like in the Table.3.
Metarule Search
Fundamental purpose of GILP is to find the most concise metarule in efficient way. Metarules are checked throughout the execution of Q, which are their PSM counterpart, and they run in the KB ecosystem. The Q metarule checking, is driven by the following macro tasks:
• +G. Find the most unrestrictive (with least amount of computationally expensive joins) rule that matches +S.
• G. Gradually restrict the rule for pruning out S while maintaining +S.
The successful metarule is the first that fulfill all constraints; therefore, the search is iterative process starting from the simplest metarule, to the most admissible complicated one. Q is generated by a given metarule under examination and return an answer corresponding to one of those types:
• +KO: metarule does not match +S.
• KO: metarule matches +S but does not match S.
• OK: metarule successfully checked.
We will consider from now on, KBs with 2arity relations only and we’ll omit, for simplicity, the metasymbol R in the metarule generation.
+G Search
This is the phase intended to find the relation pattern that matches +S. The most obvious metarule to start with is just [X0, X1] which corresponds to nodes in KB already linked with an existing relation R. Then it will try with metarules with progressively greater distance^{2} between the variable pair X0 and X1, like in this sequence:
Table 1  +G phase
The max distance of generated rules is configurable. The generated metarules must include all the possible permutations^{3} of the variable positions in the target KG.
G Search
The G phase is performed over the set of +G metarule candidates that already have match +S (they returned KO in +G phase). G append restrictions over the variables already present in the candidate metarule.
Restrictions are defined as one or multiple relations that link a variable of type X or Z with one or more constants C. The algorithm for generating constraints is the same of +G but rather than generating combinations are permutations of X variables, it uses one X or Z variable within a constant C.
G add more restrictions when none of in the current cycle (in Table.2 column G1) has found an OK result. G search will increase the number of restriction searches (for ex: the G2) when a more restrictive discrimination is required for matching S.
The process continues until a solution is found or all admissible restrictions (configurable) failed. In such case, G is repeated but with the following +G metarule candidate.
Configuration parameters regulate the max number of vars, the max multihop level, and the max total number of restrictions.
Skip Impossible Metarule Branches
Though computational complexity in GILP is categorized in the class of NP, the algorithm efficiently cut down branches of search spaces that can’t qualify as solutions.
Table 3  G phase
PSM Metarule Execution
The metarule selected for inquiry will be used to generate a platform specific query (Q).
Answer Set Programming
ASP is a logic programming paradigm oriented to solve NP problems, and its declarative style is particularly suitable for didactical purposes. If you are not familiar with ASP, you can read some material about it.
Before running Q, +S and S are injected (or made available) when running Q. The pos
and neg
predicates in Q are pointing respectively to +S and S tuples.
Ex: pos(Berlin, Germany), neg(Barcelona, Spain)
.
Let’s take the initial example of finding P for the new relation capitalOf
. A metarule returning OK is [X0, X1; X0, C0; X1, C1]
.
This metarule is used to show the entire process that bring a Q into P.
The corresponding ASP program^{4} for it is:
1 {rel(R0,R1,R2,C0,C1)} 1 : pos(X0,X1), R0(X0,X1), R1(X0,C0), R2(X1,C1)
: pos(X0,X1), rel(R0,R1,R2,C0,C1), not
^{5}(R0(X0,X1), R1(X0,C0), R2(X1,C1))
: rel(R0,R1,R2,C0,C1), R0(X0,X1), R1(X0,C0), R2(X1,C1), neg(X0,X1)
The lines are explained as follows:
 Grounding. New predicate
rel
is generated as solution candidate.rel
is the product of the given metarule variable matching +S. This line will generate as many solutions as the number of combinations of the enclosing variables. A solution is an independent set of predicates. In ASP this phase is called grounding.  +G phase. It prunes out all solutions that do not match all +S.
 G phase. It prunes out all solutions that do match with S.
If all lines (1,2,3) return solutions, they can be used to generate P, and the ASP query Q returns OK result. If after line 2 there are solutions, but there are none after line 3, the result will be KO.
In the case of successful OK, the rel
predicate is materialized as:
rel(partOf, isA, isA, capital, country)
The metarule is then rewritten within the materialized variables from Q, and it is transformed into the body of P:
[X0, X1; X0, C0; X1, C1] => partOf(X, Y), isA(X, capital), isA(Y, country)
Ultimately, P:
capitalOf(X, Y) ← partOf(X, Y), isA(X, capital), isA(Y, country)
SparQL PSM
SparQL is a standard query language for graph databases.
There are 2 types of queries in this PSM:
 Check: it is used whether the given metarule is KO, OK or +KO, without any return of
matched relations or constants.  Bind: it is executed only after OK and returns relations and constants used for building the actual program that solve the positive and negative cases.
In SparQL, nodes are expressed as URI. For convenience we will express all node and relations with the same empty namespace:
.
Check query example:
select (count(*) as ?pos_res) (sum(if(?neg = true, 1, 0)) as ?neg_res){
{
select * {
:Berlin ?R0 :Germany. :Berlin ?R1 ?C0. :Germany ?R2 ?C1.
:London ?R0 :England. :London ?R1 ?C0. :England ?R2 ?C1.
}
}
bind(exists{
filter not exists{
:NY ?R0 :USA. :NY ?R1 ?C0. :USA ?R2 ?C1.
}
filter not exists{
:Barcelona ?R0 :Spain. :Barcelona ?R1 ?C0. :Spain ?R2 ?C1.
}
} as ?neg)
}
If the neg_res
returned variable is greater than zero, then the answer will be OK. Else if the pos_res
is greater than zero, then the answer will be +KO.
Otherwise, the answer will be OK.
In case of OK result, the bind query will be performed for materializing required variables as:
select * {
{
select * {
:Berlin ?R0 :Germany. :Berlin ?R1 ?C1. :Germany ?R2 ?C2.
:London ?R0 :England. :London ?R1 ?C1. :England ?R2 ?C2.
}
}
filter not exists{
:NY ?R0 :USA. :NY ?R1 ?C0. :USA ?R2 ?C1.
}
filter not exists{
:Barcelona ?R0 :Spain. :Barcelona ?R1 ?C0. :Spain ?R2 ?C1.
}
} limit 5
The limit 5
(configurable) indicates that there is a max of 5 programs that might be useful to obtain. The different is just different constants (An). Out of the bind query we can obtain the variable materializations.
Final program P
^{6} will be:
select ?X ?Y {
?X :partOf ?Y. ?X :isA ‘capital’. ?Y :isA ‘country’.
}
Method Applicability
This method finds its natural applicability on graph based databases, but the same principle applies for any kind of database handling with discrete data.
The method assumes that predicates (or records in relational DB) have arity equals two as in the given example. It is easy to adapt this method for different arity, and that adaptation is omitted in the current documentation.
The distinction between metarule selection and Q execution phases brings practical benefits runtime optimization.
Metarule generation is not dataintensive while it requires to be implemented with high level programming languages, it is appropriate to run it outside the DB environment.
Q execution on the other hand, is dataintensive and must be as closest as possible to the data.
State of the Art in ILP
ILP is an old field of research in ML and there are many algorithms. The one proposed here is attributable to the metarules family which already has several implementations (Emdeet al., 1983; De Raedt & Bruynooghe, 1992; Flener, 1996; Kietz & Wrobel, 1992; Wang et al.,2014; Muggleton et al., 2015; Cropper & Muggleton, 2016; Kaminski et al., 2018; Evans &Grefenstette, 2018; Bain & Srinivasan, 2018) in particular Popper ) but all of those methods do not treat the metarule generation, selection and optimization, which is actually a milestone of the method presented here.
Emde et al 1983
De Raedt & Bruynooghe, 1992
Flener, 1996
Kietz & Wrobel, 1992
Wang et al.,2014
Muggleton et al., 2015
Cropper & Muggleton, 2016
Kaminski et al., 2018
Evans &Grefenstette, 2018

Differently from gradientbased ML on which the loss function determines the discrepancy between expected and obtained results in continuous space (real numbers), the E input is required because of the symbolic nature of the process. ↩

Here, distance is defined by the number of joins necessary to link X0 with X1. In a graph, it could be represented as the shortest path between two nodes: X0 and X1. ↩

We consider that not all relations in KB are symmetric therefore the metasymbol position (subject or object in the 2arity predicate) is relevant. ↩

The generation is parametric to the metarule’s variables, and it is omitted here. ↩

The relation
not
(negation as a failure) is reserved keyword in ASP and it is not a materialized predicate in the KB. If the predicate insidenot
is true, then it will fail and viceversa. ↩ 
The SparQL version is isomorphic to the initial Datalog script given as example, the two notations are equivalent. ↩