Hyper-Graph Store Representation and Query Mechanism
| Id | Status | Submission Date | Grant Date | Authorship |
|---|---|---|---|---|
| 18/972,932 | Pending | 07 Dec 2024 | 100% |
Accessing data efficiently for analytics is an tacit requirement for any data service and it is critical for data-intensive applications. Graphs are a particular type of data representation on which single nodes - the minimal unit of manageable information - are connected with others through relations. Relations - or edges - in graphs may connect two nodes, or may connect a multitude of nodes. In the latter case, they are called hyper-edges.
Hyper-graphs (set of hyper-edges) are not a naive data representation. They are more common than the name may suggest. Relational databases are a form of hyper-graphs, for example. If you consider the table name as the label of the edge type, and each row as an edge of the specified edge type, the cell would be the single node1.
Another case that matches with the hyper-graph representation is programs. A simple paradigm in programming that shows the unavoidable similarity is a particular coding representation called Datalog, the format used to describe all data processing tasks hereafter. A Datalog program is an unordered set of predicates identifiable by the predicate type (or name) and its arity (it’s number of arguments). E.g.: the predicate (or fact) locatedIn(munich, germany) is identified as locatedIn/2. There could be an arbitrary number of occurrences of this predicate type with any arity > 0.
Datalog may represent control structures usually present in programming, and also may define data. The account given in this writing is about embedding data and programs structures in memory to favor retrievals in efficient way, while keeping encoding complexity at minimum.
Data may consist of primitive types but also structured types altogether, the only prerequisite necessary is a software facility to identify objects and encode them into integers (hashing) which is not discussed here. The encoding and retrieval mechanisms deal only with integers2. Access to encoded structures is performed by fast hashing or direct memory accesses to guarantee high throughput.
Data Structure
nodes
type: {T => Int}3
This attribute, maps a value passed as input into a integer. This mapping allows to deal only with integers in the encoding and retrieval processes, making the hashing much faster.
posPredNode
type: [{Int => {Int => Int}}]4
This attribute individuates a node by starting from its position (array index) in a specific predicate instance. E.g.: [{0 => {1 => 2}}] encodes the node 2 in the instance 1 of the predicate type 0 in the position 0 (the array’s index).
arities
type: {(Int, Int) => Int}5
Specify the arity of a predicate instance. A predicate type might materialize in instances with different arity. Like: hasProperty/2 or hasProperty/3 are instances of the same type (hasProperty) but different arity.
posNodePreds
type: [{Int => <(Int,Int)>}]6
This attribute leads to identify a set of predicate instances linked to a specific node for a given position.
nodePredPos
type: {Int => {Int => {Int => <Int>}}}
given a node, its return the mapping of predicate instances and the position where the node is located.
There are other attributes used for adding new edges once the graph is already encoded. They are not essential for encoding graphs from scratch, but they assist on extending encoded graphs. These are:
maxNode
type: Int
the max key used in nodes.
maxPredIns
type: {Int => Int}
the max instance key for each predicate type.
Constructor
The graph is encoded with the structure depicted above by providing the a set of edges in with or without an already present graph. For simplicity there in only one single case, the injection of new edges in already present graph7.
Edge $E$ is a tuple with size greater than zero. The first (foremost left) element is the predicate type (PT) while the subsequent others are arguments of that PT.
\(E = (PT, arg_1...arg_n)\)
The input is then a graph $G$ and a set8 of edges, Input = (G, <E>). For each edge in <E> the following algorithm should be performed:
- retrieve/create the predicate type encoding from
nodes. - increment
maxPredInsvalue for the specific predicate type. - set the edge arity in
arities. - for each argument
argat positionposinE:- retrieve/create the
argencoding innodes - set
posPredNodeat positionposthe map withpredicate_type => predicate_instanceas keys andargas value. - set
posNodePredat positionposthe map withargas key and then add the pair(predicate_type,predicate_instance)in the corresponding set. - set
nodePredPoswith keyargand value a nested map withpredicate_type => predicate_instanceandposas value.
- retrieve/create the
Query
There are three ways to retrieve data from $G$:
- get all encoded edges.
- get all edges that contains a given node.
- factoid and confirmation edge query.
The (1) could be meaningful for generic purposes of edge flushing. The (2) and (3) are the basic query functionalities that can be extended by downstream processes for wide variety of retrieval, e.g.: multi-hop query composition (joins), query with negations.
1. decode graph
get all edges back means to decode the graph into its originating edges as submitted in the constructor.
2. node neighbors
The functionality returns edges that contain a given node node, either as predicate type or argument.
- lookup
nodeinnodePredPosand get the related sequencepredPosswith elements of type(pinstance,pos)whereposis the position ofnodeinpinstance. - for each
pinstanceandposinpredPoss:- look up the
arityofpinstanceinarities - for each
pred-posfrom0toarity:- lookup the node
n2inposPredNodein correspondence ofpred-pos/pinstance- enqueue
n2as argument of the returning edge.
- enqueue
- lookup the node
- look up the
3. factoid and confirmation query
Factoid queries are intended to retrieve some specific argument (or set of arguments) by defining placeholders (or variables) such as: hasProperty(sky, X). In this case X is a variable the query engine should materialize in a way that the resulting edges exist in the graph.
Unification9 is applied when 2 or more variables are identical. E.g.: two points trace a vertical line when they lay in the horizontal coordinate are the same like in vertical(point(X,Y1), point(X,Y2)). In this case, X is identical in both points. Unification, guarantees in the example that X values applied to both point predicates are the same. The query process implements unification for this purpose.
Confirmation queries answer with a yes or no to a precise query that does not hold any variable, such as hasProperty(sky, blue) in this case the query engine returns yes.
The query input is an edge $E$ which arguments might be values or variables. The query procedure might be described with:
- A recursive function processes every non-variable element $e_i$ in $E$10 in sequence and filter edges that match with $e_i$ at the same position $i$. The edges that are filtered are the ones already filtered by previous argument $e_{i-1}$.
- The edges filtered by by $E$ arguments are eventually transformed into a matrix with width equals to the arity of $E$, and height as the number of filtered edges.
- The matrix’s columns that refers to variables in the input $E$ are then extracted.
- Unification is applied to correspondent identities among values.
- In case of factoid request the columns variables are returned as solutions to the query.
- In case of confirmation query, if the matrix is empty then the result will be
nootherwiseyes.
Considerations
similarity search
This method is not intended to provide functionalities such as fuzzy/similarity search11 but the graph data structure would not prevent to easily implement them, since the interface with actual given data is kept in a single mapping storage nodes.
shared value mapping among multiple graphs
The nodes graph attribute is dedicated to map the given values with the internal (numeric) encoding. This attribute might be shared - for optimization purposes - by a multitude of graphs while maintaining the uniqueness of {T => Int} mapping.
-
The main difference one might notice is that in a graph, nodes are uniquely identified via an URI, while in relational databases the uniqueness is defined via normalization. ↩
-
Integers leverage fast hashing. The mapping between internal entities are defined by the high performing Fast Mergeable Integer Maps algorithm. ↩
-
The
{}denotes a function that takes an input and return aninteger.Tdenotes any type. It equals theMapdata type present in many programming languages. ↩ -
Squared brackets denotes an indexed sequence, in particular it refers to the
arraydata type. ↩ -
The type enclosed by
()is a tuple, in this case is apairofinteger. ↩ -
The
< >delimiters indicate a set data type of elements with type(Int,Int), apairofinteger. Set type guarantees uniqueness of the collected elements. ↩ -
For brand-new graph without prior edges, the already given graph will be just an empty graph. ↩
-
Every $E$ is unique and therefore exists only 1 occurrence $E$ with equal predicate and arguments’ values. For specifying more occurrences, it must be made explicitly with further provided information. ↩
-
Where $i$ is the related position of $e$ in $E$. ↩
-
In SQL there are several functionalities as
LIKEthat correspond to that definition. ↩