User:Joal/JanusGraph

From Wikitech
Jump to navigation Jump to search

This page documents my work-log in playing with JanusGraph.

Links

WDQS

Wikidata

Janus/Gremlin/Tinkerpop

2019-09-06 - Install and tests on Cloud VPS

I have already made an install of JanusGraph on cloud-VPS, but it was almost a year ago at All-Hands. Starting fresh :)

I'm using (JanusGraph needs Java 1.8) and JanusGraph 0.0.4 (latest as of 2019-09-06)

Install and test

  • I created the janus1-1 large instance using Debian 9.9 Stretch (java 8 needed) in the cloud-VPS analytics project with Horizon
  • I followed the introduction section of https://docs.janusgraph.org/, changing ElasticSearch index-backend to Lucene (single node test).

Install

ssh janus1-1.analytics.eqiad.wmflabs

sudo apt-get install unzip openjdk-8-jre

wget https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip

unzip janusgraph-0.4.0-hadoop2.zip

cd janusgraph-0.4.0-hadoop2

./bin/gremlin.sh

Test

/**********************************************
  Configure and load graph
**********************************************/

// Create graph with updated configuration (Lucen instead of ES)
graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')

// Load graph example
GraphOfTheGodsFactory.load(graph)

// Create graph traversal object
g = graph.traversal()

/**********************************************
  Test graph traversal
**********************************************/

// Create a pointer to the Saturn node using index on name
saturn = g.V().has('name', 'saturn').next()

// Show the Saturn node pointer values ([name:[saturn], age:[10000]])
g.V(saturn).valueMap()

// Use the Saturn node pointer to find Saturn grand-child name (hercules)
g.V(saturn).in('father').in('father').values('name')
==>hercules

// Use geo index to find edges having a place property within 50km of Athen (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50)))

// Find nodes connected to the edges found by geo-index query and show their names (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50))).
  as('source').inV().as('god2').
  select('source').outV().as('god1').
  select('god1', 'god2').by('name')

2019-09-16 - Analyze and prepare Wikidata-truthy for loading

(started in 2019-09-06 session)

Load dump

import org.apache.spark.sql.functions._

val dump_path = "/user/joal/wmf/data/raw/mediawiki/wikidata/truthy_ntdumps/20190904"
val df = spark.read.format("csv").
  option("mode", "FAILFAST").
  option("delimiter", " ").
  load(dump_path).
  withColumnRenamed("_c0", "origin").
  withColumnRenamed("_c1", "link").
  withColumnRenamed("_c2", "dest").
  drop("_c3").
  cache()
  
df.count()
// 4139056936 - Wow!!!

df.where("origin is null or link is null or dest is null").count()
// 0 - \o/ well-formed data

df.select("origin").distinct().count()
// 124151595

df.select("dest").distinct().count()
// 685067856

df.select("link").distinct().count()
// 6516

Analyze and filter links

// Check http://www.wikidata.org/prop links
df.where("link like '<http://www.wikidata.org/prop%'").select("link").distinct.count
// 6486                                                              
df.where("link like '<http://www.wikidata.org/prop/direct/%'").select("link").distinct.count
// 6351                                                              
df.where("link like '<http://www.wikidata.org/prop/direct-normalized/%'").select("link").distinct.count
// 135 direct or direct-normalized only - GOOD :)

// Check other link types and evaluate whether to keep them or not
df.where("link not like '<http://www.wikidata.org/prop%'").groupBy("link").count.sort(desc("count")).show(100, false)
/*
+----------------------------------------------------+----------+               
|link                                                |count     |
+----------------------------------------------------+----------+

** To Keep (in addition to direct and direct-normalized links):
|<http://www.w3.org/2002/07/owl#sameAs>              |2464024   |


** To remove:
  ** We drop all language related classes
|<http://schema.org/name>                            |322876582 |
|<http://schema.org/description>                     |2014877520|
|<http://www.w3.org/2004/02/skos/core#prefLabel>     |322876582 |
|<http://www.w3.org/2000/01/rdf-schema#label>        |322876582 |
|<http://www.w3.org/2004/02/skos/core#altLabel>      |67929447  |

  ** We drop metadata
|<http://schema.org/dateModified>                    |62033634  |
|<http://schema.org/version>                         |62033306  |
|<http://schema.org/about>                           |62033306  |

  ** We drop redondant info (this is described as link-property)
|<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>   |121788917 |

// Origin is PXXX and dest is a derivative of PXXX without other usage (origin or dest)
|<http://wikiba.se/ontology#qualifier>               |6595      |
|<http://www.w3.org/2002/07/owl#someValuesFrom>      |6595      |
|<http://wikiba.se/ontology#claim>                   |6595      |
|<http://wikiba.se/ontology#statementProperty>       |6595      |
|<http://www.w3.org/2002/07/owl#onProperty>          |6595      |
|<http://wikiba.se/ontology#referenceValue>          |6595      |
|<http://wikiba.se/ontology#reference>               |6595      |
|<http://wikiba.se/ontology#directClaim>             |6595      |
|<http://wikiba.se/ontology#statementValue>          |6595      |
|<http://wikiba.se/ontology#qualifierValue>          |6595      |
|<http://wikiba.se/ontology#directClaimNormalized>   |4758      |
|<http://wikiba.se/ontology#referenceValueNormalized>|4758      |
|<http://wikiba.se/ontology#statementValueNormalized>|4758      |
|<http://wikiba.se/ontology#qualifierValueNormalized>|4758      |

  ** Used for dumps info only (a lot of same rows ... weird)
|<http://www.w3.org/2002/07/owl#imports>             |328       |
|<http://schema.org/softwareVersion>                 |328       |
|<http://creativecommons.org/ns#license>             |328       |

// Interesting for value interpretation (kept in own dataset)
|<http://wikiba.se/ontology#propertyType>            |6595    |

// Values from link is also used as origin -- Seems not used in truthy
|<http://wikiba.se/ontology#novalue>                 |6595    |
// Used with previous  -^
|<http://www.w3.org/2002/07/owl#complementOf>        |6595    |
+----------------------------------------------------+----------+

Checking code samples (to be updated for each link type and format):
df.where("link = '<http://www.w3.org/2002/07/owl#someValuesFrom>'").show(20, false)
df.where("link = '<http://www.w3.org/2002/07/owl#someValuesFrom>'").selectExpr("split(origin, '/')[4] as o", "split(dest, '/')[4] as d").where("o <> d").count
df.where("""
  origin like '<http://www.wikidata.org/prop/P%'
  AND link != '<http://www.w3.org/2002/07/owl#someValuesFrom>'""").show(20, false)
*/

val fdf = df.where("""
      -- Dropping descriptions, labels, versions...
      link NOT IN (
        '<http://schema.org/name>',
        '<http://schema.org/description>',
        '<http://www.w3.org/2004/02/skos/core#prefLabel>',
        '<http://www.w3.org/2000/01/rdf-schema#label>',
        '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
        '<http://www.w3.org/2004/02/skos/core#altLabel>',
        '<http://schema.org/dateModified>',
        '<http://schema.org/about>',
        '<http://schema.org/version>',
        '<http://wikiba.se/ontology#claim>',
        '<http://wikiba.se/ontology#statementProperty>',
        '<http://wikiba.se/ontology#qualifier>',
        '<http://wikiba.se/ontology#directClaim>',
        '<http://wikiba.se/ontology#statementValue>',
        '<http://wikiba.se/ontology#qualifierValue>',
        '<http://wikiba.se/ontology#reference>',
        '<http://www.w3.org/2002/07/owl#onProperty>',
        '<http://wikiba.se/ontology#referenceValue>',
        '<http://wikiba.se/ontology#statementValueNormalized>',
        '<http://wikiba.se/ontology#referenceValueNormalized>',
        '<http://wikiba.se/ontology#directClaimNormalized>',
        '<http://wikiba.se/ontology#qualifierValueNormalized>',
        '<http://www.w3.org/2002/07/owl#someValuesFrom>',
        '<http://wikiba.se/ontology#novalue>',
        '<http://www.w3.org/2002/07/owl#complementOf>',
        '<http://wikiba.se/ontology#propertyType>'
      ) AND  origin != '<http://wikiba.se/ontology#Dump>'
  """).cache()
  
fdf.count()
// 825663549 -- Better - Need some naming effort

// checking and defining property types
df.where("link = '<http://wikiba.se/ontology#propertyType>' and origin not like '<http://www.wikidata.org/entity/P%'").count
// 0

val propertyTypes = df.
  where("link = '<http://wikiba.se/ontology#propertyType>'").
  selectExpr("replace(split(origin, '/')[4], '>', '') AS property", "replace(split(dest, '#')[1], '>', '') as propertyType").
  cache()

Rename values for simplicity and size, and pivot some data (names and property-types)

// Check origin values
fdf.where("origin not like '<http://www.wikidata.org/entity/%'").count
// 0 - We have the scheme :)

val fdfr1 = fdf.selectExpr(
  "replace(split(origin, '/')[4], '>', '')AS origin",
  """CASE
        -- Dropping difference between direct and direct-normalized (only used for ExternalId)
        WHEN link like '<http://www.wikidata.org/prop/%' THEN replace(split(link, '/')[5], '>', '')
        WHEN link = '<http://www.w3.org/2002/07/owl#sameAs>' THEN 'SameAs'
        ELSE link
  END as link""",
  """CASE
        WHEN link = '<http://schema.org/name>' THEN replace(dest, '@en', '')
        ELSE dest
  END as dest"""
).cache()


val fdfj1 = fdfr1.join(propertyTypes, col("link") === col("property"), "left").drop("property").cache
fdfj1.groupBy("propertyType").count().sort(desc("count")).show(100, false)
/*
+----------------+---------+                                                    
|propertyType    |count    |
+----------------+---------+
|WikibaseItem    |387850621|
|String          |169523450|
|ExternalId      |135193147|
|Time            |32810131 |
|Monolingualtext |27807321 |
|Quantity        |9643796  |
|GlobeCoordinate |7603827  |
|CommonsMedia    |3651002  |
|Url             |3045010  |
|null            |2464024  |
|WikibaseProperty|24410    |
|Math            |4105     |
|GeoShape        |2844     |
|WikibaseLexeme  |1299     |
|MusicalNotation |291      |
|TabularData     |16       |
|WikibaseSense   |13       |
|WikibaseForm    |2        |
+----------------+---------+

*/

// Looking for dest renaming scheme

fdfj1.where("propertyType is null").select("link").distinct.show(20, false)
+------+                                                                        
|link  |
+------+
|SameAs|
+------+

fdfj1.where("propertyType = 'ExternalId'").show(20, false)

fdfj1.where("link = 'SameAs' and dest not like '<http://www.wikidata.org/entity/Q%'").count
// 0
fdfj1.where("""propertyType = 'WikibaseProperty'
                 AND dest not like '<http://www.wikidata.org/entity/P%'
                 AND dest not like '_:genid%'""").count
// 0
fdfj1.where("""propertyType = 'WikibaseLexeme'
                 AND dest not like '<http://www.wikidata.org/entity/L%'
                 """).count
// 0
fdfj1.where("""propertyType = 'WikibaseSense'
                 AND dest not like '<http://www.wikidata.org/entity/L%'
                 """).count
// 0
fdfj1.where("""propertyType = 'WikibaseForm'
                 AND dest not like '<http://www.wikidata.org/entity/L%'
                 """).count

fdfj1.where("dest like '%^^<%'").groupBy("propertyType").count().sort(desc("count")).show(100, false)
/*
+-------------------------------------------+--------+                          
|linkPropType                               |count   |
+-------------------------------------------+--------+
|<http://wikiba.se/ontology#Time>           |32779728|
|<http://wikiba.se/ontology#Quantity>       |9643081 |
|<http://wikiba.se/ontology#GlobeCoordinate>|7602994 |
|<http://wikiba.se/ontology#Math>           |4105    |
+-------------------------------------------+--------+
*/
fdfj1.where("dest like '%^^<%'").selectExpr("split(replace(dest, '^^', ';;'), ';;')[1] as typ", "propertyType").groupBy("typ", "propertyType").count().sort(desc("count")).show(100, false)
/*
+-------------------------------------------------+---------------+--------+    
|typ                                              |propertyType   |count   |
+-------------------------------------------------+---------------+--------+
|<http://www.w3.org/2001/XMLSchema#dateTime>      |Time           |32779728|
|<http://www.w3.org/2001/XMLSchema#decimal>       |Quantity       |9643081 |
|<http://www.opengis.net/ont/geosparql#wktLiteral>|GlobeCoordinate|7602994 |
|<http://www.w3.org/1998/Math/MathML>             |Math           |4105    |
+-------------------------------------------------+---------------+--------+
We can get rid of the inner-value type :)
*/

fdfj1.where("""linkProptype = '<http://wikiba.se/ontology#WikibaseItem>'
                 AND dest not like '<http://www.wikidata.org/entity/Q%'
                 AND dest not like '_:genid%'""").count
// 0 - \o/ only origin values :)




// Renaming values in 2 stages to remove doule-quotes
val fdfr2 = fdfj1.selectExpr(
  "origin",
  "link",
  """CASE
        WHEN link = 'SameAs' THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType = 'WikibaseItem' AND dest like '<http://www.wikidata.org/entity/Q%'
          THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType = 'WikibaseProperty' AND dest like '<http://www.wikidata.org/entity/P%'
          THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType IN ('WikibaseLexeme', 'WikibaseSense', 'WikibaseForm')
          THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType IN ('Time', 'Quantity', 'GlobeCoordinate', 'Math') and dest like '%^^<%' THEN split(replace(dest, '^^', ';;'), ';;')[0]
        ELSE dest
  END as dest""").cache()

val fdfr3 = fdfr2.selectExpr(
  "origin",
  "link",
  """CASE
        WHEN dest rlike '^"[^"]*"$' THEN replace(dest, '"', '')
        ELSE dest
  END as dest""").cache()

fdfr3.repartition(8).write.mode("overwrite").option("compression", "gzip").json("/user/joal/test_wdqs/truthy_20190916")