User:Reversedragon/Embedding RDF in wiki pages (SeaTurtle proposal)

During the early development of this project, we attempted to use Wikibase, but quickly realized that there were some serious problems with it in terms of getting started from a fresh MediaWiki install. One of the biggest problems was that it was not trivial to transclude Wikibase Items into regular wiki pages through templates — you need to install the ParserFunctions extension, which for reasons we were not able to install. Another problem was that it was not easy to list Wikibase Items using normal MediaWiki Categories, or customize category listings to be suitable for displaying basic Item metadata. This would seem obvious for the use case of new users exploring a given knowledge base who do not know much about MediaWiki or the way Wikibase works — nested categories are a great way to get a feel for what kinds of overall topics the knowledge base covers, and MediaWiki's built-in Category mechanism is easy and uncomplicated for new users to learn to edit.

This slowly led us toward the development of a new MediaWiki extension, tentatively named SeaTurtle. Said extension is currently only in the planning phase, but this page will provide help with coding it.

Wikibase representation

Wikibase already has an established RDF representation of its data model for exporting triples and creating large data dumps, as well as an official OWL ontology for some prefixes and data types. (Warning: the OWL file may start a download.) This is conceivably useful for the purpose of manipulating Items inside text pages.

Although Wikibase does have a JSON serialization format, this can quickly get unwieldy for purposes such as Entity labels. Ideally, if we are to store Entities in pages as text, we should make sure the representation of Entities more or less follows a similar design philosophy to wikitext, such that edits to an Item make sense in a page diff view, and so forth.

Storing RDF inside text pages

RDF Turtle can be embedded into an HTML page using the HTML script tag. The Turtle format is relatively easy to work with because it mostly consists of simple lines of three consecutive concept URIs: <Subject> <Predicate> <Object>, Q1 a Item. Any seemingly long URI prefixes can also be abbreviated with the @prefix directive.

<script type="text/turtle" id="P15"><![CDATA[
@base         <https://research.moraleconomy.au/entity/> .   # called wd: in Wikidata's dumps
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix wikibase: <http://wikiba.se/ontology#> .
@prefix wdt:  <https://research.moraleconomy.au/prop/direct/> .

<P15>
  # "a" is pre-defined to stand for rdf:type, but the purpose of it is arguably hard to remember
  rdf:type wikibase:Property ;   # the entity is a Property
  wikibase:propertyType wikibase:WikibaseItem ;   # the Property takes an Item as its value

  # claims with simple values - "truthy" statements
  wdt:P30 <P14> ;   # inverse property of - appears in work  (en)

  schema:description "a Property"@en ;   # Item or Property description
  skos:altLabel      "depicts"@en , "illustrates"@en , "tropes"@en , "motifs"@en;   # each alternate label
  skos:prefLabel     "work depicts or contains"@en .   # primary label
  # if the least-edited thing is last, it's harder to forget the last period.
]]></script>

For a prettier display of RDF statements and the potential to use the multilingual editor to add or change lines just as with Wikibase, we want MediaWiki to find this Turtle block and interpret all its lines into a series of claims internally. This should not be difficult — Turtle is relatively easy to parse. Once we know MediaWiki can parse the contents, a simpler syntax for marking Turtle blocks may be in order:

```ttl
@base  <https://research.moraleconomy.au/entity/> .
# ... prefixes ...

<P15>  a  wikibase:Property .
# ... characteristics or claims ...
<P15>  skos:altLabel  "depicts"@en .
<P15>  skos:altLabel  "illustrates"@en .
<P15>  skos:prefLabel  "work depicts or contains"@en .
# we could make every statement a "complete sentence" like this; the semicolon and comma are just abbreviations.
```

The only real issue with this simplified syntax, or even the wordier HTML syntax, is that it does not necessarily signal to MediaWiki that this is not simply "a" random Turtle example for decoration but really is the Turtle block to represent this particular wiki page. For this purpose we can make use of MediaWiki's Magic words feature and add a string which marks a Turtle block as an Entity block. In theory, MediaWiki should scan the overall page for __ENTITY__, mark the page as a potential Entity if found, and if this string was found on a line inside a particular Turtle block, begin parsing the Turtle block as a special Entity block rather than simply for syntax highlighting.

```ttl __ENTITY__
@base  <https://research.moraleconomy.au/entity/> .  # __ENTITY__ could also go in a comment, etc
# ... prefixes ...

<P15>  a  wikibase:Property .
# ... characteristics or claims ...
<P15>  skos:prefLabel  "work depicts or contains"@en .
```

Lexemes

Querying Turtle blocks

Turtle blocks may seem almost too simple. How can the search function possibly query for them? Well, every Entity within Wikibase secretly contains a JSON file, and Wikibase manages to search through these just fine.

It seems (?) that the claims inside Wikibase Items are cached in a SQL database. If this is the case, searching for any label or Property ID should not be any slower than if Entities were input in JSON. For that matter, a regular text search should be able to find un-localized Item and Property IDs or an Item's own localized labels on any Item.