USE CASE

Title: Extending KBpedia With Wikipedia Categories
Short Description: This use case describes how knowledge graphs, such as KBpedia, which need to be kept current and extended based on new knowledge and new mappings, can be so maintained with acceptable effort and accuracy.
Problem: Knowledge graphs are under constant change and need to be extended with specific domain information for particular domain purposes. The combinatorial aspects of adding new external schema or concepts to an existing store of concepts can be extensive. Effective means at acceptable time and cost must be found for enhancing or updating these knowledge graphs.
Cognonto Approach: Cognonto's KBpedia knowledge graph is extended under this use case by adding more concepts from the Wikipedia category structure, "cleaned" to produce its most natural classes.These extensions are made using a SVM classifier trained over graph-based embedding vectors generated using the DeepWalk method. The source graph is based on the KBpedia knowledge graph structure linked to the Wikipedia categories. Means are put in place to test and optimize the parameters used in the machine learning methods. These mapping techniques are then visualized using the TensorFlow Projector web application to help build confidence that the mapping clusters are correct. The overall process is captured by a repeatable pipeline with statistical reporting, enabling rapid refinements in parameters and methods to achieve the best-performing model. Once appropriate candidate categories are generated using this optimized model, the results are then inspected by a human to make the final selection decisions. The semi-automatic methods in this use case can be applied to extending KBpedia with any external schema, ontology or vocabulary.
Key Findings
  • General methods are explored and documented for how to extend the KBpedia knowledge graph
  • A variety of machine learning methods can reduce the effort required to add new concepts by 95% or more
  • A workable and reusable pipeline leads to fast methods for testing and optimizing parameters used in the machine learning methods
  • Care should be taken when using visualizations to validate relatonships, especially when using dimension reduction techniques
  • To our knowledge, this use case is a unique combination of relatively new artificial intelligence methods
  • The approach documented in this use case is applicable to extending a knowledge graph with any external schema, ontologies or vocabularies.
 
 
 
 

In other use cases we have covered multiple ways to use KBpedia to create training corpuses, both for unsupervised learning and for positive and negative training sets for supervised learning. 1 , 2 Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets depending on what structures are used to slice-and-dice the knowledge space. These different ways to create corpuses or training sets yield different predictive powers depending on the task at hand.

The other noted use cases have covered two ways to leverage the KBpedia Knowledge Graph to create positive and negative training corpuses automatically:

  1. Using the links that exist between each KBpedia reference concept and its related Wikipedia page(s)
  2. Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of named entities.

We now demonstrate a third way to create a different kind of training corpus:

  1. Using the KBpedia aspects linkages.

Aspects are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. Aspects are another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.

As we have used in previous use cases, we continue with the musical domain. There exist two aspects of interest amongst the 80 available:

  1. Music
  2. Genres

What we will do first is to query the KBpedia Knowledge Graph using the SPARQL query language to get the list of all of the KBpedia reference concepts that are related to the Music or the Genre aspects. Then, for each of these reference concepts, we will count the number of named entities that can be reached in the complete KBpedia structure.

prefix kko: <http://kbpedia.org/ontologies/kko#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix dcterms: <http://purl.org/dc/terms/> 
prefix schema: <http://schema.org/>

select distinct ?class count(distinct ?entity) as ?nb
from <http://dbpedia.org>
from <http://www.uspto.gov>
from <http://wikidata.org>
from <http://kbpedia.org/1.10/>
where
{
  ?entity dcterms:subject ?category .

  graph <http://kbpedia.org/1.10/>
  {
    {?category <http://kbpedia.org/ontologies/kko#hasMusicAspect> ?class .}
    union
    {?category <http://kbpedia.org/ontologies/kko#hasGenre> ?class .}
  }
}
order by desc(?nb)
reference concept nb
http://kbpedia.org/kko/rc/Album-CW 128772
http://kbpedia.org/kko/rc/Song-CW 74886
http://kbpedia.org/kko/rc/Music 51006
http://kbpedia.org/kko/rc/Single 50661
http://kbpedia.org/kko/rc/RecordCompany 5695
http://kbpedia.org/kko/rc/MusicalComposition 5272
http://kbpedia.org/kko/rc/MovieSoundtrack 2919
http://kbpedia.org/kko/rc/Lyric-WordsToSong 2374
http://kbpedia.org/kko/rc/Band-MusicGroup 2185
http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup 2078
http://kbpedia.org/kko/rc/Ensemble 1438
http://kbpedia.org/kko/rc/Orchestra 1380
http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup 1335
http://kbpedia.org/kko/rc/Choir 754
http://kbpedia.org/kko/rc/Concerto 424
http://kbpedia.org/kko/rc/Symphony 299
http://kbpedia.org/kko/rc/Singing 154

Seventeen KBpedia reference concepts are related to the two aspects we want to focus on. The next step is to take these 17 reference concepts and to create a new domain corpus with them. We will use version 1.10 of KBpedia to create the full set of reference concepts that will scope our domain by inference.

Next we will try to use this information to create two totally different kinds of training corpuses:

  1. One that will rely on the links between the reference concepts and Wikipedia pages
  2. One that will rely on the linkages to external vocabularies to create a list of named entities that will be used as the training corpus.

Creat the Model With Reference Concepts

The first training corpus we want to test is the one that uses the linkage between KBpedia reference concepts and Wikipedia pages. The first thing is to generate the domain training corpus with the 17 seed reference concepts and then to infer other related reference concepts.

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])


(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Song-CW"
                       "http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Single"
                       "http://kbpedia.org/kko/rc/RecordCompany"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MovieSoundtrack"
                       "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                       "http://kbpedia.org/kko/rc/Band-MusicGroup"
                       "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Ensemble"
                       "http://kbpedia.org/kko/rc/Orchestra"
                       "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                       "http://kbpedia.org/kko/rc/Choir"
                       "http://kbpedia.org/kko/rc/Symphony"
                       "http://kbpedia.org/kko/rc/Singing"
                       "http://kbpedia.org/kko/rc/Concerto"]
  kbpedia
  "resources/aspects-concept-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

(create-pruned-pages-dictionary-csv "resources/aspects-concept-corpus-dictionary.csv"
                                    "resources/aspects-concept-corpus-dictionary.pruned.csv" 
                                    "resources/aspects-corpus-normalized/")

Once pruned, we result with a domain that has 108 reference concepts, which will enable us to create models with 108 features. The next step is to create the actual semantic interpreter and the SVM models:

;; Load dictionaries
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv")

;; Create the semantic interpreter
(build-semantic-interpreter "aspects-concept-pruned" "resources/semantic-interpreters/aspects-concept-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the SVM model vectors
(build-svm-model-vectors "resources/svm/aspects-concept-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

;; Train the linear SVM classifier
(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Then we have to evaluate this new model using the gold standard:

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive:  28
False positive:  0
True negative:  923
False negative:  66

Precision:  1.0
Recall:  0.29787233
Accuracy:  0.93510324
F1:  0.45901638

Now let's try to find better hyperparameters using grid search:

(svm-grid-search "grid-search-aspects-concept-pruned-tests" 
                       "resources/svm/aspects-concept-pruned/" 
                       "resources/gold-standard-full.csv"
                       :selection-metric :f1
                       :grid-parameters [{:c [1 2 4 16 256]
                                          :e [0.001 0.01 0.1]
                                          :algorithm [:l2l2]
                                          :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.84444445 
 :c 1
 :e 0.001 
 :algorithm :l2l2
 :weight 30}

After running the grid search with these initial broad range values, we found a configuration that gives us 0.8444 for the F1 score. So far, this score is the best to date we have gotten for the full gold standard2, 3. Let's see all of the metrics for this configuration:

(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/"
                 :weights {1 30.0}
                 :v nil
                 :c 1 
                 :e 0.001
                 :algorithm :l2l2)

(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive:  76
False positive:  10
True negative:  913
False negative:  18

Precision:  0.88372093
Recall:  0.80851066
Accuracy:  0.972468
F1:  0.84444445

These results are also the best balance between precision and recall that we have gotten so far2, 3. Better precision can be obtained if necessary but only at the expense of lower recall.

Let's take a look at the improvements we got compared to the previous training corpuses we had:

  • Precision: +4.16%
  • Recall: +35.72%
  • Accuracy: +2.06%
  • F1: +20.63%

This new training corpus based on the KBpedia aspects, after hyperparameter optimization, did increase all the metrics we calculate. The most striking improvement is recall, which improved by more than 35%.

Creating Model With Entities

The next training corpus we want to test is one that uses the linkage between KBpedia reference concepts and linked external vocabularies to get a series of linked named entities as the positive training set of for each of the features of the model.

The first thing to do is to is to create the positive training set populated with named entities related to the reference concepts. We will get a random sample of ~50 named entities per reference concept:

(require '[cognonto-rdf.query :as query])
(require '[clojure.java.io :as io])
(require '[clojure.data.csv :as csv])
(require '[clojure.string :as string])

(defn generate-domain-by-rc
  [rc domain-file nb]
  (with-open [out-file (io/writer domain-file :append true)]
    (doall
     (->> (query/select
           (str "prefix kko: <http://kbpedia.org/ontologies/kko#>
                 prefix rdfs: <http://www.w3.org/2000/01/rdf-schema>
                 prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

                 select distinct ?entity
                 from <http://dbpedia.org>
                 from <http://www.uspto.gov>
                 from <http://wikidata.org>
                 from <http://kbpedia.org/1.10/>
                 where
                 {
                   ?entity dcterms:subject ?category .
                   graph <http://kbpedia.org/1.10/>
                   {
                     ?category ?aspectProperty <" rc "> .
                   }
                 }
                 ORDER BY RAND() LIMIT " nb) kb-connection)
          (map (fn [entity]
                 (csv/write-csv out-file [[(string/replace (:value (:entity entity)) "http://dbpedia.org/resource/" "")
                                           (string/replace rc "http://kbpedia.org/kko/rc/" "")]])))))))


(defn generate-domain-by-rcs 
  [rcs domain-file nb-per-rc]
  (with-open [out-file (io/writer domain-file)]
    (csv/write-csv out-file [["wikipedia-page" "kbpedia-rc"]])
    (doseq [rc rcs] (generate-domain-by-rc rc domain-file nb-per-rc))))

(generate-domain-by-rcs ["http://kbpedia.org/kko/rc/"
                         "http://kbpedia.org/kko/rc/Concerto"
                         "http://kbpedia.org/kko/rc/DoubleAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Psychedelic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Religious"
                         "http://kbpedia.org/kko/rc/PunkMusic"
                         "http://kbpedia.org/kko/rc/BluesMusic"
                         "http://kbpedia.org/kko/rc/HeavyMetalMusic"
                         "http://kbpedia.org/kko/rc/PostPunkMusic"
                         "http://kbpedia.org/kko/rc/CountryRockMusic"
                         "http://kbpedia.org/kko/rc/BarbershopQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/FolkMusic"
                         "http://kbpedia.org/kko/rc/Verse"
                         "http://kbpedia.org/kko/rc/RockBand"
                         "http://kbpedia.org/kko/rc/Lyric-WordsToSong"
                         "http://kbpedia.org/kko/rc/Refrain"
                         "http://kbpedia.org/kko/rc/MusicalComposition-GangstaRap"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Klezmer"
                         "http://kbpedia.org/kko/rc/HouseMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-AlternativeCountry"
                         "http://kbpedia.org/kko/rc/PsychedelicMusic"
                         "http://kbpedia.org/kko/rc/ReggaeMusic"
                         "http://kbpedia.org/kko/rc/AlternativeRockBand"
                         "http://kbpedia.org/kko/rc/AlternativeRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Trance"
                         "http://kbpedia.org/kko/rc/Ensemble"
                         "http://kbpedia.org/kko/rc/RhythmAndBluesMusic"
                         "http://kbpedia.org/kko/rc/NewAgeMusic"
                         "http://kbpedia.org/kko/rc/RockabillyMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Blues"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Opera"
                         "http://kbpedia.org/kko/rc/Choir"
                         "http://kbpedia.org/kko/rc/SurfMusic"
                         "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/MusicalComposition-JazzRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Country"
                         "http://kbpedia.org/kko/rc/CountryMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-PopRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Romantic"
                         "http://kbpedia.org/kko/rc/Recitative"
                         "http://kbpedia.org/kko/rc/Chorus"
                         "http://kbpedia.org/kko/rc/FusionMusic"
                         "http://kbpedia.org/kko/rc/MovieSoundtrack"
                         "http://kbpedia.org/kko/rc/GreatestHitsAlbum-CW"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Christian"
                         "http://kbpedia.org/kko/rc/ClassicalMusic-Baroque"
                         "http://kbpedia.org/kko/rc/MusicalComposition-NewAge"
                         "http://kbpedia.org/kko/rc/MusicalComposition-TraditionalPop"
                         "http://kbpedia.org/kko/rc/TranceMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Celtic"
                         "http://kbpedia.org/kko/rc/LoungeMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Reggae"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Baroque"
                         "http://kbpedia.org/kko/rc/Trio-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/Symphony"
                         "http://kbpedia.org/kko/rc/MusicalComposition-RockAndRoll"
                         "http://kbpedia.org/kko/rc/PopRockMusic"
                         "http://kbpedia.org/kko/rc/IndustrialMusic"
                         "http://kbpedia.org/kko/rc/JazzMusic"
                         "http://kbpedia.org/kko/rc/MusicalChord"
                         "http://kbpedia.org/kko/rc/ProgressiveRockMusic"
                         "http://kbpedia.org/kko/rc/GothicMusic"
                         "http://kbpedia.org/kko/rc/LiveAlbum-CW"
                         "http://kbpedia.org/kko/rc/NewWaveMusic"
                         "http://kbpedia.org/kko/rc/NationalAnthem"
                         "http://kbpedia.org/kko/rc/OldieSong"
                         "http://kbpedia.org/kko/rc/Song-Sung"
                         "http://kbpedia.org/kko/rc/RockMusic"
                         "http://kbpedia.org/kko/rc/Aria"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Disco"
                         "http://kbpedia.org/kko/rc/GospelMusic"
                         "http://kbpedia.org/kko/rc/BluegrassMusic"
                         "http://kbpedia.org/kko/rc/FolkRockMusic"
                         "http://kbpedia.org/kko/rc/RockAndRollMusic"
                         "http://kbpedia.org/kko/rc/Opera-CW"
                         "http://kbpedia.org/kko/rc/HitSong-CW"
                         "http://kbpedia.org/kko/rc/Tune"
                         "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup"
                         "http://kbpedia.org/kko/rc/RapMusic"
                         "http://kbpedia.org/kko/rc/RecordCompany"
                         "http://kbpedia.org/kko/rc/MusicalComposition-ACappella"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Electronica"
                         "http://kbpedia.org/kko/rc/Music"
                         "http://kbpedia.org/kko/rc/GlamRockMusic"
                         "http://kbpedia.org/kko/rc/LoveSong"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Gothic"
                         "http://kbpedia.org/kko/rc/MarchingBand"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Punk"
                         "http://kbpedia.org/kko/rc/BluesRockMusic"
                         "http://kbpedia.org/kko/rc/TechnoMusic"
                         "http://kbpedia.org/kko/rc/SoulMusic"
                         "http://kbpedia.org/kko/rc/ChamberMusicComposition"
                         "http://kbpedia.org/kko/rc/Requiem"
                         "http://kbpedia.org/kko/rc/MusicalComposition"
                         "http://kbpedia.org/kko/rc/ElectronicMusic"
                         "http://kbpedia.org/kko/rc/CompositionMovement"
                         "http://kbpedia.org/kko/rc/StringQuartet-MusicGroup"
                         "http://kbpedia.org/kko/rc/Riff"
                         "http://kbpedia.org/kko/rc/Anthem"
                         "http://kbpedia.org/kko/rc/HardRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-BluesRock"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Cyberpunk"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Industrial"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Funk"
                         "http://kbpedia.org/kko/rc/Album-CW"
                         "http://kbpedia.org/kko/rc/HipHopMusic"
                         "http://kbpedia.org/kko/rc/Single"
                         "http://kbpedia.org/kko/rc/Singing"
                         "http://kbpedia.org/kko/rc/SwingMusic"
                         "http://kbpedia.org/kko/rc/Song-CW"
                         "http://kbpedia.org/kko/rc/SalsaMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Jazz"
                         "http://kbpedia.org/kko/rc/ClassicalMusic"
                         "http://kbpedia.org/kko/rc/MilitaryBand"
                         "http://kbpedia.org/kko/rc/SkaMusic"
                         "http://kbpedia.org/kko/rc/Orchestra"
                         "http://kbpedia.org/kko/rc/GrungeRockMusic"
                         "http://kbpedia.org/kko/rc/SouthernRockMusic"
                         "http://kbpedia.org/kko/rc/MusicalComposition-Ambient"
                         "http://kbpedia.org/kko/rc/DiscoMusic"] "resources/aspects-domain-corpus.csv")

Next let's create the actual positive training corpus and let's normalize it:

(cache-aspects-corpus "resources/aspects-entities-corpus.csv" "resources/aspects-corpus/")
(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

We end up with 22 features for which we can get named entities from the KBpedia Knowledge Base. These will be the 22 features of our model. The complete positive training set has 799 documents in it.

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-entities-corpus-dictionary.pruned.csv")

(build-semantic-interpreter "aspects-entities-pruned" "resources/semantic-interpreters/aspects-entities-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

(build-svm-model-vectors "resources/svm/aspects-entities-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/")

(train-svm-model "svm.aspects.entities.pruned" "resources/svm/aspects-entities-pruned/"
                 :weights nil
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Now let's evaluate the model with default hyperparameters:

(evaluate-model "svm.aspects.entities.pruned" "resources/gold-standard-full.csv")
True positive:  9
False positive:  10
True negative:  913
False negative:  85

Precision:  0.47368422
Recall:  0.095744684
Accuracy:  0.906588
F1:  0.15929204

Now let's try to improve this F1 score using grid search:

(svm-grid-search "grid-search-aspects-entities-pruned-tests" 
                 "resources/svm/aspects-entities-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
:selection-metric :f1
:score 0.44052863
:c 4
:e 0.001
:algorithm :l2l2
:weight 15}

We have been able to greatly improve the F1 score by tweaking the hyperparameters, but the results are still disappointing. There are multiple ways to automatically generate training corpuses, but not all of them are born equal. This is why having a pipeline that can automatically create the training corpuses, optimize the hyperparameters, and evaluate the models is more than welcome since this is the bulk of the time a data scientist has to spend to create his models.

Conclusion

After automatically creating multiple different positive and negative training sets, after testing multiple learning methods and optimizing hyperparameters, we found the best training sets with the best learning method and the best hyperparameter to create an initial, optimal, model that has an accuracy of 97.2%, a precision of 88.4%, a recall of 80.9% and overall F1 measure of 84.4% on a gold standard created from real, random, pieces of news from different general and specialized news sites.

The thing that is really interesting and innovative in this method is how a knowledge base of concepts and entities can be used to label positive and negative training sets to feed supervised learners and how the learner can perform well on totally different input text data (in this case, news articles). The same is true when creating training corpuses for unsupervised leaning3.

The benefit from an operational standpoint is that all of this searching, testing and optimizing can be performed by a computer automatically. The only tasks required by a human are to define the scope of a domain and to manually label a gold standard for performance evaluation and hyperparameters optimization.