A Knowledge Graph Data Expansion Method Based on Relational Propensity Categories with Legal Applications

. By introducing the definition of propensity categories of relations, the implicit information in the knowledge graph is mined and the MCBE (Maximum Clique Based Expansion) algorithm is used for data expansion. Experimental results show that with baseline models TransE, RotatE, HAKE and Complex on FB15K dataset, its MRR and Hits@1 metrics are improved by (7.9%, 9.6%), (4.2%, 3.3%), (2.7%, 4.8%) and (1.7%, 2.4%), respectively. Experiments are also conducted on the FB15K, YAGO3-10, NELL-995 and DBpedia50 datasets using the TransE model as a baseline, and its MRR and Hits@1 metrics are improved on the above datasets by (7.9%, 9.6%), (0.3%, 27.7%), (20.1%, 100%), (4%, 38.7%), respectively. Finally, the MCBE algorithm is applied to the self-constructed knowledge graph of "anti-drug law" and its MRR and Hits@1 metrics are improved by (10.6%, 12.3%). The experimental results show that MCBE algorithm improves the prediction accuracy of legal knowledge graph.


Introduction
With the development of big data and the Internet, huge amounts of data are constantly being generated in all areas of life and industry.The first step in finding a suitable application for this data is to choose a reasonable way of (1) Computational efficiency issues The storage form of the knowledge graph is equivalent to that of a graph in a data structure, where each entity needs to be represented by a node, and each relationship between entities has to be represented by an edge, which results in a knowledge base that usually has tens of thousands of nodes, as well as hundreds of thousands of edges between nodes.Learning from the knowledge graph comes down to the complexity and time cost of the algorithms.Conventional algorithms are often not capable of learning large-scale knowledge graphs, and can only be used to learn small and medium-sized knowledge graphs.
(2) Data sparsity problem Data in knowledge graphs also tends to suffer from sparsity due to its possible long-tail effect: a small number of nodes have a large number of relationships, and most nodes have only one or two pieces of ternary information.
For example, the DBpeida50 dataset has 49,900 entities and 32,388 triples, each entity involves only 0.65 triples on average, and most of the entity objects have only one triple information that can be utilised for learning.For these entities, their information can only be inferred through the information of their related nodes or the related paths in the knowledge graph, which makes learning extremely difficult, and the prediction and complementation tasks involving them tend to have extremely low accuracy rates.
(3) Scalability issues The reality is that the knowledge base must be constantly updated and improved, but many knowledge graph algorithms do not take into account the knowledge graph has a subsequent complementary situation exists, in the learning and training is completed and put into use, often can not be added to the new entities and relationships, or re-learning of the cost is huge.This also puts forward new requirements for the scalability of knowledge graph related algorithms.
The current mainstream research approach to knowledge mapping through KRL (Knowledge Represent Learning), i.e., representing triples with low-dimensional embedding vectors, has been able to basically control the computational efficiency problem in the first challenge in a reasonable amount of time.[8] This paper focuses on the data sparsity problem proposed above, inspired by methods such as AMIE (Association Rule Mining Under Incomplete Evidence), which introduces external information to enhance knowledge graph learning.[9] Most of the existing knowledge graph learning models only consider the direct information in the knowledge graph and ignore the large amount of implicit information in the knowledge graph.By introducing the concepts of "relationship category" and "category tendency of relationship", and combining them with traditional graph algorithms, a knowledge graph data expansion embedding algorithm based on maximal clusters is proposed.
The algorithm expands the amount of data in the knowledge graph dataset by mining the implicit information in the knowledge graph itself, thereby expanding the amount of data in the knowledge graph dataset without introducing external information, and allowing the model to learn more fully in the subsequent learning of the knowledge representation, thus improving the performance and accuracy of the model so that it can be better applied to downstream tasks.The algorithms in this paper have been experimented and validated on a self-constructed knowledge graph of legal document cases related to the Drug Law of the People's Republic of China, in addition to a generalpurpose dataset, to obtain an improvement in accuracy on the offence prediction task.
The main contributions of this paper are as follows: 1.The algorithm is dataset-independent, applicable to all knowledge graphs, there is no requirement for the composition of the knowledge graph dataset, and all of them have a certain effect, and there is more considerable accuracy improvement on certain knowledge graphs, which indicates that this algorithm has good scalability and adaptability.
2. This algorithm is data desensitised, and the training learning process does not need to know the actual entity and relationship names, i.e., it can be learnt for graphs with only graph structures.
3. Relative to the learning process, the present algorithm is relatively simple and does not increase the time complexity of the model, adding only a few minutes to ten minutes of computing time for small to large knowledge graph datasets.
4. The algorithm combines the graph algorithm with the knowledge graph for the first time, proving that the traditional algorithm can also play a role in the training of complex models nowadays, and providing certain ideas for the direction of the subsequent combination of the graph algorithm and the knowledge graph.
5. Experiments and analyses were carried out on the constructed legal knowledge graph, which proved that the algorithm in this paper can practically enhance the application effect of knowledge graph in law.

Semantic Matching Models Based on Tensor Decomposition
Semantic models based on tensor decomposition pass to represent the potential semantics of entities as well as relations in a vector space.The more basic ones of this type are RESCAL bilinear model, RESCAL uses full-rank matrices as the representation of relations, and all that exists between entities and relations are matrix operations, thus allowing deep interaction of information between entities and relations.[10] However, it also brings the following drawbacks, (1) The one using RESCAL model is easy to overfitting and possesses lower generalisation ability.(2) The RESCAL model is stored in matrix form, and its operational complexity is of order of magnitude with the number of entities and relations, making it difficult to be applied to large-scale knowledge graphs.
Improving on this problem, the DistMult model RESCAL model is based on relaxing the constraints on the relation matrix.[11] The relationship matrix is simplified into a diagonal matrix, which drastically reduces the complexity of the operation and can fetch better results.However, because of the simplification of the relation matrix, it results in DistMult only being able to portray symmetric relations in the knowledge graph, and not being able to solve other types of relations.
The ComplEx model, on the other hand, introduces for the first time a complex domain for knowledge mapping based on DistMult.[12] It is able to portray both symmetric and asymmetric relationships.

Translation-based modelling
TransE is the first translation-based model, the translation model is inspired from the (King-Man) vector in Word2Vec should be approximately equivalent to the (Queen-Woman) vector, and each triad (Entity 1, Relationship, Entity 2) in the knowledge graph dataset is regarded as a process of translating from the head Entity 1 vector to the tail Entity 2 through the Relationship r.That is, the Relationships are regarded as translations from Entity 1 to Entity 2, and for each triad (Entity 1, Relationship, Entity 2), TransE expects Entity 1+Relationship=Entity 2. This is shown in Fig. 1, where h, r, and t denote the head Entity, Relationships, and Tail Entity, respectively.[13]  TransE model is relatively simple, with fewer parameters and lower computational complexity, but it can make good use of the potentially complex semantic links between entities and relations, and has amazing effects on modelling large-scale sparse graphs, which can be said to have become a representative and important model in knowledge graphs.However, TransE is still insufficient for the portrayal of relations, and is overstretched for the three kinds of complex relations: 1-N, N-1, and N-N.Complex relationships here are categorised according to the number of entities connected to the head and tail of the relationship, e.g.1-N means that each head entity will correspond to N tail entities in that type of relationship.
TransH, on the other hand, in order to solve the difficulty of the TransE model in dealing with complex relations, proposed the idea of letting an entity have different representations under different relations.[14] Each entity vector is regarded as a projection of different relation planes in specific different triples.It's able to solve the many-to-many relationship problem existing in the knowledge graph to some extent.
TransR, on the other hand, first proposed the concept of an entity being an aggregate of multiple attributes, i.e., an entity can have multiple attributes, but for a particular relation, it may involve only some of the attributes of that entity.[15] TransR believes that each relation has its own semantic space, and for each ternary ancestor in the Knowledge Graph, TransR projects its entity into the corresponding relationship space before training.
TransD constructs a dynamic mapping matrix for each entity-relationship pair by simultaneously considering the diversity of entities and relationships.[16]  The method focuses on how to assess the validity of the rules, generally using a confidence level approach, the higher the confidence level of a rule, indicating that there are multiple instances of the rule in the knowledge graph, it is more likely to be accurate.
On the basis of AMIE, RUGE uses a recurrent training approach, which adds rules into the knowledge graph dataset as soft rules and then trains them in a traditional way.The soft rules are extracted again as a result of the completed training and trained again.With this reciprocation to a suitable threshold, good results were also achieved.
The above two methods prove that the learning training effect of knowledge graph models can be improved by adding useful information to the knowledge graph.
However, there are two drawbacks: (1) is that wrong information may be added during the training process, and if the threshold is not well controlled, too much wrong information added may instead reduce the accuracy of the model.( 2) is that these methods require some human intervention.

Methodology of this paper 3.1 Entity categories in the knowledge graph
Knowledge graphs are constructed by natural language processing or manually, and the entities involved are basically from the same direction or domain.For example, in the case of a film knowledge graph, the entities involved may be the name of the film, the director, the actors, the dubbing, and so on.For example, for the names "Inception" and "Creed", they both belong to films, so they may exist in the knowledge graph in the form of a triad of (Inception, director, Nolan) and (Creed, director, Nolan).These two triples can be generalised in the form of (film title, director, Nolan), so their entity category is film.The same entity can have more than one category attribute, e.g."Xu Zheng" is an actor in the film "I am not the God of Medicine" and a director in the film "Me and My Hometown".

Relational propensity categories in the knowledge graph
The traditional category definition of a relationship is the category definition of the meaning of the word relationship itself, for example, in the triad (Inception, director, Nolan) and (Silver Age, author, Wang Xiaobo), the relationship "director" and the relationship "author" may be relatively similar in the traditional definition and may belong to the same category.The relationship between "director" and "author" may be similar in the traditional definition, and may belong to the same category.Because they are also in the form of a "work" corresponding to a "creator", i.e. they are both in the form of (work, director or author, creator).
Therefore, the common category of director and author can be summarized as "creative professions".
In this paper, a new definition of "relationship tendency category" is proposed: i.e., the category of a relationship is not determined by itself, but by the categories of its head and tail entities, in the form of (head tendency, tail tendency).The head tendency represents the tendency of the relationship to the category of the head entity, and the tail tendency represents the tendency of the relationship to the category of the tail entity.So for the above two triples, it is possible to convert the above two relations from the form (work, director or author, creator) to the form (film name, director, director's name) and (book name, author, author's name) respectively.
The preferred category of the relation "director" is (film name, director's name) and the preferred category of the relation "author" is (book name, author's name).So the relations "director" and "author" are not of the same tendency category, whereas the relations "author" and "translator" are of the same tendency category.The two relations "author" and "translator" have the same tendency category (book name, author's name).

Data expansion methods based on relational propensity categories
Based on the concept of "relationship propensity category" proposed above, this paper proposes a data expansion method based on relationship propensity category, the core of which is to find out whether the propensity category of each relationship is the same through the implicit links contained in the triples in the knowledge graph dataset, if the propensity categories of the two relationships are basically the same, it can be considered that the two relationships involving triples can be replaced with the head entity or the tail entity to achieve the expansion of the dataset.If the tendency categories of the two relationships are basically the same, it can be considered that the ternary groups involved in these two relationships can be replaced by the head entity or the tail entity, so as to achieve the expansion of the dataset.If only the "head tendency" of the two relationships is the same, then the head entities of their related triples can provide mutual expansion, and similarly, if only the "tail tendency" of the two relationships is the same, then the tail entities of their related triples can provide mutual expansion.For example, if the propensities of the two relationships "situated" and "located" are essentially the same, the triad (West Lake, situated, Hangzhou) can be expanded with (West Lake, located, Hangzhou) to give the relationship "located" additional propensities.The relation "located in" can be learnt by expanding the triad with (West Lake, Located, Hangzhou), thus giving the relation "located in" more related triples.
The method has several advantages: 1.The method can be applied to all knowledge graph datasets, i.e., no matter what kind of knowledge graph dataset can be used, and there is no requirement for the number of triples, entities and relationships in the knowledge graph dataset.
2. The method is data desensitised and can be applied to knowledge graph datasets that are represented only by codes and not by actual names.
3. The method has a low time complexity, and the time complexity of its data expansion method is approximately equal to O(n), where n is the number of triples in the knowledge graph dataset.
The difficulty of the method is that a suitable method needs to be found to find relationships with consistent propensity categories in a knowledge graph and noisy data needs to be avoided.
Since common knowledge graph datasets basically have a number of triples in the hundreds of thousands, and the number of their relationships is only a few hundred to a few thousand, leading to the fact that each relationship may involve hundreds or even thousands of triples, making ordinary methods such as clustering difficult to apply to determine whether the propensity classes of relationships are consistent, and prone to introducing noisy data.
So in this paper, from the graph algorithm, the maximal clusters algorithm of graphs is introduced to find the tendency categories of relations, and a maximal clusters-based embedding algorithm for knowledge graph data expansion is proposed to ensure that the data added to the knowledge graph is reliable.

Maximum Clique Algorithm
The Maximum Clique Algorithm also known as Maximum Clique Problem is a classical combinatorial optimization problem in graph theory which is defined as follows: Given an undirected graph G = (V, E), where V is a non-empty set, called the vertex set, and E is the set of unordered binary groups consisting of elements in V, called the edge set, edges in an undirected graph are all unordered pairs of vertices, and a graph U is said to be a complete subgraph of G if U ∈ V and there is (u, v) ∈ E for any two vertices u, v ∈ U.For any undirected graph G = (V, E), its complement G' = (V', E') is defined as V' = V and (u, v) ∈ E' if and only If U is a complete subgraph of G, then it is also a null subgraph of G' and vice versa.Thus, there is a one-to-one correspondence between the groups of G and the independent sets of G'.Exceptionally, U is a maximal clique of G if and only if U is a maximal independent set of G'.
The goal of the maximal clusters algorithm is to find a subgraph of the graph in which the points are completely connected, and there is no way to add any point from the graph in such a way that the subgraph still maintains this property.The nodes 1, 2, and 5 in Figure 2 below are a maximal clump: The nodes in the same maximal clusters, which all have very similar nature characteristics, are introduced to the entity categories of the knowledge graph, which can be considered as similar categories.

Construction of a Knowledge Graph Based on the Anti-Drug Law of the People's Republic of China
In this paper, based on 500 case documents under the relevant categories of "anti-drug law", a legal knowledge map of "anti-drug law" is constructed by extracting textual information using natural language processing and combining with manual screening.
The main types of entities involved in the knowledge graph are key information such as "name of person", "type of drug" and "grams".Four types of relationships are defined, namely "drug trafficking", "providing shelter", "drug possession" and "drug delivery".".

Knowledge Graph Data Expansion Algorithm Based on Relational Propensity Categories
With the maximal clusters algorithm and the concept of "relationship propensity category" proposed above, the algorithm in this paper can find the largest clusters of relationships with the same "relationship propensity category" in the knowledge graph dataset, so as to complement each other with the triads of relationships belonging to a cluster, thus improving the learning effect of the final model on each relationship and entity.The algorithm in this paper can find the largest group of relationships with the same "relationship tendency category" in the knowledge graph data set, so as to complement each other for the relationships involved in a group, thus improving the learning effect of the final model on each relationship and entity, and improving the accuracy of the model.This algorithm firstly analyses the triples in the training set of the knowledge graph dataset to obtain the entity list of each relationship, and then compares and filters it with the threshold S through the similarity calculation method, and calculates the head and tail entity similarity table of each relationship respectively.Through the head entity similarity table, each relationship is regarded as a "node" in the graph, and the relationships with which the head entity is related are regarded as having "edges", so as to obtain the head tendency similarity graph of the relationship.Similarly, do the same for the tail entity similarity table.Then, the maximal clique algorithm is used on the head tendency similarity graph and tail tendency similarity graph respectively, so as to obtain the relationship groups with consistent head tendency or consistent tail tendency.The data involved in the relationships of each maximal clusters can be complemented with each other.

In this paper, the above Maximum Clique Based Data Expansion Algorithm is called MCBE (Maximum Clique
Based Expansion Algorithm) method.This chapter explains the detailed steps of this algorithm based on the FB15K dataset, mainly using the head entity as an example.
The overall flow of the algorithm is divided into the following steps: 1. Count the head tendency and tail tendency of each relationship based on the triads in the training set.
2. Calculate the similarity between the head and tail tendencies of the relationships, the more similar two relationships are considered to be related, and calculate the two complete relationship correlation table.
3. Using the head and tail relationship tables, each relationship is treated as a "vertex", and two related relationships are treated as if they had "edges".A new relationship graph is formed to be used in the Maximum Cluster Algorithm to find the clusters in the graph (the clusters found are the communities between the relationships).The reason for not using the similarity of relationship propensity categories directly as a judgement of similarity is as follows: assuming that in a dataset, there is similarity of propensity categories between relationship A and relationship B, similarity of propensity categories between relationship B and relationship C, and similarity of propensity categories between relationship C and relationship D, using similarity alone as a judgement criterion will result in the expansion process, where relationships A, B, C, and D are equal, while the actual part of similarity between each relationship is different.In the expansion process, relations A, B, C and D are equated, while the actual similarity between each relation is different.For example, "watermelon" and "passion fruit" belong to the category of "fruit", "watermelon" and "chocolate" belong to the category of "fruit", "watermelon" and "chocolate" belong to the category of "fruit", "watermelon" and "chocolate" belong to the category of "fruit"."Chocolate" both belong to the category "sweet food", but "passion fruit" and "chocolate" are not equivalent.Adopting this simple approach will result in the model converging all the relationship embeddings when learning entity and relationship embeddings, which will lead to a very poor model.
The nature of maximal clusters ensures that nodes belonging to the same maximal clusters have almost identical properties, and the maximal clusters algorithm to find the maximal clusters in the relational association graph ensures that all relations belonging to a cluster have a highly similar tendency to the head or tail of the relation.That is, in other words, judging by similarity can only speculate that the relationships involved have certain connections, while judging by the maximal clusters algorithm can ensure that all relationships belonging to the same clusters are near-synonyms or even synonyms, avoiding the addition of noisy data.

Calculation of the propensity to relate
Firstly

Similarity calculation to obtain relationship association table
Using the data of relationship head tendency and relationship tail tendency obtained in the previous step, i.e., Table 2, the similarity of the tendency of the head \ tail entity involved in each relationship is computed two by two, and the similarity computation method adopted here is the cosine similarity method, and below the threshold S is regarded as no tendency similarity between these two relationships.The threshold S is usually inversely proportional to the number of relationships R in the dataset, and has been tested experimentally, in most cases the threshold is more appropriate to be taken as 0.85, and if the number of relationships is small (e.g. less than 100), the threshold is more appropriate to be taken as 0.9.Take head entity dumping as an example, the specific algorithm is as follows Algorithm 2: With Algorithm 2, the relational association table for the FB15K dataset can be obtained as shown in Table 3 below: According to Table 3, each relationship id is regarded as a "vertex", and if there is a similar relationship between two relationships, an "edge" is added to them to form a new relationship graph.

Maximum clique algorithm to find relationship communities
Based on the relational association graph, the maximal clusters algorithm described in section 3.5 of this paper was used, with the specific implementation choosing the Bron-Kerbosh algorithm, which is generally considered to be more efficient than other similar algorithms, to find the maximal clusters between relations in this new relational association graph.The final maximal clusters found are shown in Table 4 below: The same algorithm is used to process the operations on the tail tendency of the relation to obtain the maximum clique table of the tail tendency to be expanded.
For relations where both the head and tail tendencies are identical, this paper considers them to be fully equivalent relations that can be used to expand the dataset directly by substituting the relations of the involved triples.

Results
In this chapter, FB15K, YAGO-3, NELL-995, and DBpeida50, which are the widely used standard datasets for knowledge graphs, are selected as the experimental data sources, and four representative mainstream embedded learning models for knowledge graphs, TransE, RotatE, HAKE, and Complex, are used in conjunction with the

MCBE expansion method proposed in this paper to
Conducting experiments.This chapter mainly adopts the metrics enhancement in the link prediction experiment results to verify the effectiveness of this algorithm.Finally, this chapter demonstrates the effectiveness of this paper's algorithm in the field of legal knowledge graph by conducting experiments and analyses on the constructed knowledge graph of "anti-drug law".
This chapter mainly covers data set, benchmark model selection, link prediction, evaluation metrics, experimental setup and result analysis, and "anti-drug law" knowledge mapping experiment and analysis.

Introduction to the dataset
The datasets used for the algorithms in this chapter are the FB15K dataset, the YAGO3-10 dataset, the NELL-995 dataset and the DBpeida50 dataset.They are chosen for the following reasons: (1) These are mainstream datasets commonly used in the evaluation of various knowledge graph algorithms.(2) These datasets all contain a large number of entities and relations, which are difficult to learn, and all of them contain a variety of relation classes including symmetric, asymmetric, and inverse relations, which cover a variety of types of representative datasets.
where FB15K was introduced in the 2013 paper YAGO3, on the other hand, is a huge semantic knowledge base derived from Wikipedia, WordNet, and GeoNames.Currently, YAGO3 holds knowledge about more than 10 million entities (e.g., individuals, organisations, cities, etc.) and contains more than 120 million facts about these entities.The huge amount of data makes learning from the YAGO3 dataset very difficult to portray with simple models.YAGO3-10 is a subset of the YAGO3 dataset.
NELL-995 comes from the results of 995 iterations of NELL (Never-Ending Language Learning system), a semantic machine learning system developed by a team of researchers at Carnegie Mellon University.
The DBpeida Knowledge Base, on the other hand, contains a variety of descriptions of 5.9 million things, including everything from geography, demographics, . DBpeida50 is a subset of the DBpeida Knowledge Base.
The details of each dataset are given in Table 5 below, containing the number of entities and the number of relationships involved in the dataset, as well as the number of triples included in the training, validation, and test sets, respectively:

Benchmark model selection
In this chapter, three representative pairs of knowledge graph embedded representation learning models, TransE, RotatE and Complex, are used to test the results of the data augmentation algorithm.The three benchmark models are described below: Where represents the positive part, which is a marginal parameter greater than zero.While and are negative sample triples generated by randomly replacing the head entity or tail entity by the positive sample triples, the generated negative sample triples are required to be absent from the original dataset, and is the similarity measure function.
TransE is one of the more basic models, only able to represent 1-to-1 relationships predominantly, but its design is simpler and can be used as a baseline model for evaluating comparisons.

The rotation-based model RotatE
RotatE views relations as rotation vectors between entities.By making the difference between the head entity and the relationship after performing the rotation operation and the tail entity as a function of its distance score, i.e., the following equation ( 2 is the calculated sampling probability, is the set sampling rate, and is the head entity, relationship and tail entity of the th positive triad respectively.is the generated th negative triad.By using a probability generating formula, the sampling of negative samples can be made more balanced and efficient, which is more suitable for model training than random sampling methods.
The negative sampling loss function is optimised so that the difference between the means of the distance score functions of the positive samples minus the distance score functions of the negative samples is as large as possible.
Hierarchical category information is also added to make the relationship embeddings with the same hierarchical category as similar as possible.
The loss function is given below in equation ( 4): (4) where is the boundary value and is the Sigmoid activation method.

The tensor decomposition based model Complex
The ComplEx scoring function is defined as equation ( 5): (5) where and are both represented by complex numbers, denotes the conjugate complex of , and denotes the real part of the obtained complex.ComplEx is able to solve both symmetric and asymmetric relations simultaneously.ComplEx is the first model that introduces a complex number approach in knowledge graph embedding.

link prediction
Link prediction is a common downstream task in knowledge graph representation learning, which can be models, and is a general dataset-independent and modelindependent knowledge graph data expansion embedding algorithm.

Multi-model experiments based on the FB15K dataset
Under the FB15K dataset, TransE, RotatE, HAKE, and Complex, which are different modelling approaches, have achieved better results than the original dataset under the expanded dataset of this algorithm.As shown in Table 6 below, where those with Clique-beginning are the results of the models under the dataset expanded by this algorithm: From the experimental results in the above table, it can be seen that the present algorithm is model-independent, and using four different models on the same dataset, all of them can get some enhancement effect, and the enhancement of the two indexes, MRR and Hits@1, is more obvious.

Experiments on expansion of different datasets under TransE model
On the other hand, using the TransE model as a benchmark, this algorithm was experimented on the FB15K, YAGO3-10, NELL-995, and DBpedia50 datasets, respectively.It is proved that the algorithm can achieve certain results on a variety of datasets, as shown in Table 7 below: we can find out a lot of large maximal clusters, so this algorithm's hit @1 result is improved by about double.
While DBpeida50 as a more complex dataset is more difficult to learn, this algorithm has only a small improvement, indicating that the expansion of this algorithm on the dataset with less response to its own characteristics is less obvious, but there is still a certain degree of improvement.
The above can generally indicate that the present algorithm is more general and can be used on most of the datasets with some enhancement results.

Experiments based on the "drug law" knowledge map
Offence prediction is a practical downstream application of the link prediction task on the legal knowledge graph, which refers to predicting the offence for which a defendant will be sentenced based on the description of the case and the factual part of the legal document.
The benchmark model used in this subsection is the TransE model, while still using having MR, MRR, Hits@1, Hit@3, Hit@10 as the evaluation metrics for the experimental results, and all other experimental parameter settings are the same as before, where the ones with the beginning of Clique-are the results of the model with the expanded dataset of this algorithm: As can be seen from the results in the above table, the experimental results of TransE model on the knowledge graph of "anti-drug law" after the expansion of this paper, there is a certain degree of improvement in MR, MRR, Hits@1, Hit@3 and Hit@10.The improvement in MRR and Hits@10 is especially obvious.After analysing the expanded knowledge map of "anti-drug law", the improvement of Hits@10 is obvious because through the expanded algorithm in this paper, similar cases are supplemented with each other, for example, the relationship between "grams of drugs and the judgement result", which can be used as a reference for many judgement documents.For example, in the relationship between "grams of drugs and the judgement result", there are many judgement documents for reference, which makes a more reasonable prediction order, resulting in a significant increase in Hits@10.
The results of the model can be used to improve the results of the "Intelligent Legal Quiz" task in addition to the "Judgement Result" task.

Experimental setup
The system and environment configuration of the machine used for the experiment is attached at the end of this chapter for reference: This algorithm is more innovative in the following ways: 1.The algorithm is dataset-independent, applicable to all knowledge graphs, has no requirements on the composition of knowledge graph datasets, and has good scalability and adaptability.
2. This algorithm is data desensitised, and the training learning process does not need to know the actual entity and relationship naming, i.e., it can be learnt for graphs with only graph structure.The points raised in this paper still have much room for research.For example, for the analysis of the overall characteristics of the knowledge graph how to better combine with the local information, and the knowledge graph of the same model should be how to do to improve the characteristics of different knowledge graphs, rather than designing a model to try to apply multiple knowledge graphs.At the same time, it also puts forward a new thinking space for the application of graph algorithms on knowledge graphs, and how other graph algorithms can be combined and improved on knowledge graphs still need to be thought about.
Questions that could be pursued in this paper in the future are.
1. Improve the data expansion methodology to accommodate more knowledge graph datasets, which is currently more generally effective for knowledge graphs with a small number of relationships.
2. It is possible to consider how other graph algorithms can be combined with knowledge graph learning algorithms.
3. How to better integrate the content of laws and judgement instruments and apply the algorithms of the present paper to broader areas of law than those related to "drug law".

4 .
Relationships in the same group are considered to have the same head or tail tendency.Further, relationships with the same head and tail tendencies are considered equivalent. 5. Completion of the triples in the original data set based on the data involved in the relationships in each of the largest clusters.

5. 2 . 1
Translation-based model TransE Given a training set S , S consists of the triad (h,r,t) , (h,t) belongs to the entity set E , and r belongs to the relation set R .The size of the embedding dimension is set according to the parameters of the model .This model is characterised by the requirement that , learns the embeddings by minimising the marginal ranking function on the training set with L1 or L2 regularisation.Its loss function is given in equation (1) below:

): ( 2 )
Negative samples that are not part of the entities present in that knowledge graph are obtained through selfadversarial negative sample sampling.Self-adversarial negative sample sampling refers to the inclusion of a probability to determine the generation probability of negative samples by means of a probability generation formula , instead of uniform sampling.It is implemented by taking a positive sample triad and probabilistically replacing its head or tail with other entities to generate a new triad (i.e., negative sample), and this negative sample cannot be the same as the existing positive sample in the knowledge graph.The probability generation formula is as follows (3):

System
This paper firstly describes the research motivation of the algorithm, focuses on the data sparsity problem in the current knowledge graph learning, combines the idea of introducing external information, puts forward the definitions of "relationship tendency" and "relationship tendency category" and proposes an embedding algorithm based on "relationship tendency category" and provides a detailed algorithm description.The definition of "relationship propensity" and "relationship propensity category" is proposed, and based on "relationship propensity category", a knowledge graph data expansion embedding algorithm based on maximal clusters is proposed and described in detail.The experimental part introduces each dataset and evaluation index, and the experimental design and result analysis are carried out on different benchmark models of the same dataset and different datasets of the same benchmark model, which proves the validity and universality of the algorithm, and finally, the experiments on legal knowledge graph prove the validity of the algorithm of this paper in the field of law.

3 .
The present algorithm is relatively simple and does not increase the time complexity of the model.

13 SHS 4 .
Web of Conferences 181, 03022 (2024) https://doi.org/10.1051/shsconf/202418103022ICDEBA2023 The algorithm combines the graph algorithm with the knowledge graph for the first time, which provides certain ideas for the direction of the subsequent combination of graph algorithm and knowledge graph.5.Experiments and analyses were carried out on the constructed legal knowledge graph, which proved that the algorithm in this paper can practically enhance the application effect of knowledge graph in law.

2.2 Data-enhanced learning models for knowledge graph representation
and only if U is not contained in a larger complete subgraph of

Table 1 .
, we read in the data in the training set, in which the Data set format description

Table 2 .
Statistics on header entities involved in relationships Table 2 represents 1200 The head entities involved in this relationship are numbered 3123, 1034, 58, and 5733.The specific algorithm is as follows Algorithm 1:

Table 3 .
Relational association table for the FB15K dataset (based on header entities)

Table 4 .
The maximum clique table obtained from the FB15K dataset (based on head entities)

Data set expansion based on the largest clique table
and the ternary groups involved in them can be complemented with each other to obtain (Entity 482, relation 8, entity 789) as well as (Entity 381, relation 77, entity 67), which increase the the number of involved triples, which can enable the model to better learn the characteristics of these some relations.
Based on the maximal clusterstable of relations, the training set is expanded, such as (Entity 381, relation 8, entity 789) and (Entity 482, relation 77, entity 67), because 8 and 77 belong to the same cluster in terms of the head tendency, ,951 entities with 1,345 different relationships.Freebase is a large open-source knowledge graph dataset in which knowledge is filled in by users.
TransE by Bordes et al.It is a subset of Freebase, which contains about 14

Table 5 .
Data set statistics

Table 6 .
Results of expansion algorithm for each model on FB15K dataset

Table 7 .
Results of expansion algorithm for TransE model on different datasets

Table 8 .
Results of the TransE model for the "Drug Law"