Convert and Publish the Wikimedia Event Stream
- 12 slices
- 15 minutes
- 65 minutes
Ingredients
- A Wikimedia Event Stream, e.g., recent changes link
- A Text Editor
- An RML Mapping Engine
- A Domain Ontology for the Stream Content
- Knowledge About OWL and RDF
- [Optional] an RSP engine to calculate some descriptive statistics
Directions
- Select one among the Wikimedia Event Streams, e.g., the recent change
- Identify the publication case ( see Figure 1)
- Find the schema that is used for the messages, if any
- Find/design an ontology to be used for conversion into RDF Stream
- Use the Text Editor to map the schema into a common ontology
- Using an mapping engine, apply the mapping on the fly
Figure 1 Figure 1
Figure 1 shows the three situations a practitioner might find when she/he wants to publish Web Streams. The lower-right quadrant identifies our ultimate goal, i.e., Streaming Linked Data. The other quadrants presents possible starting points, i.e., (upper-left) Web Data published in batches; (upper-right) Linked Data published in batches; and (lower-left) Web Data published as streams.
The case of Wikimedia Event Stream is the one identified by the lower-left quadrant, i.e., a Web Stream that is not linked yet.
To proceed creating a Linked Data Stream we follow the publication pipeline included in the following Figure.
Figure 2
We collected the information about the streams schemas on GitHub1. We also assume to have an OWL 2 ontology that capture the semantics of the WESs domain Notably, WESs are designed around the notion of event, therefore, reasonable vocabularies to annotate the data existing, e.g., Event Ontology2.
The following listing shows an example of WES recentchange data item. As the listing shows, data items are timestamped individually. We used this timestamp to name the graph containing all the event data. Regarding the recentchanges stream. We emphasize the modeling of the events types when mapping into RDF data, i.e., "edit", "new", "log", "categorize", or "external". Similarly, we take into account what could be represented as external resources like Wikidata.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
"event": "message",
"id": [
{
"topic": "eqiad.mediawiki.recentchange",
"partition": 0,
"timestamp": 1576599002001
},
{
"topic": "codfw.mediawiki.recentchange",
"partition": 0,
"offset": -1
}
],
"data": {
"$schema": "/mediawiki/recentchange/1.0.0",
"meta": {
"uri": "https://www.wikidata.org/wiki/Q78902990",
"request_id": "Xfj92gpAAK4AAG77O64AAABP",
"id": "a99cd5d4-981a-42af-9947-f065d1ee28bb",
"dt": "2019-12-17T16:10:02Z",
"domain": "www.wikidata.org",
"stream": "mediawiki.recentchange",
"topic": "eqiad.mediawiki.recentchange",
"partition": 0,
"offset": 2039616376
},
"type": "log",
"namespace": 0,
"title": "Q78902990",
"comment": "",
"timestamp": 1576599002,
"user": "Alicia Fagerving (WMSE)",
"bot": false,
"log_id": 0,
"log_type": "abusefilter",
"log_action": "hit",
"log_params": {
"action": "edit",
"filter": 64,
"actions": "",
"log": 10906339
},
"log_action_comment": "Alicia Fagerving (WMSE) triggered
[[Special:AbuseFilter/64|filter 64]], performing the action \"edit\" on
[[Q78902990]]. Actions taken: none ([[Special:AbuseLog/10906339|details]])",
"server_url": "https://www.wikidata.org",
"server_name": "www.wikidata.org",
"server_script_path": "/w",
"wiki": "wikidatawiki",
"parsedcomment": ""
}
}
The following Listings show the portion of an RML
mapping with a
JSON
source that we used for the conversion. At line 7 using
rr:graphMap
name the RDF graph containg all the triples using the
event timestamp. At line 10 add the event type using rdf:type
and the
"type" field in the JSON. The full mapping is available here.
1
2
3
4
5
6
7
<WMM> a rr:TriplesMap ;
rml:logicalSource <source> ;
rr:subjectMap [ rr:template "http://www.wikimedia.org/es/{id}" ;
rr:graphMap [ rr:template "http://wiki.time.com/{timestamp}" ] ] ;
rr:predicateObjectMap [
rr:predicate rdf:type ;
rr:objectMap [ rr:template "http://....org/es/voc/{type}"] ] [...] .
To apply the mappings we used a modified version of CARML that handles the annotation process incrementally to minimize the translation latency.
To publish WES RDF Streams, we decided to use TripleWave approach, i.e., we separate the stream description from the stream content.
Listing 3 shows an VoCaLS description for the recentchanges
stream. We included a license that is compliant with Wikimedia terms of
use. Using rdfs:seeAlso
, we linked to our ontology, the mapping file,
and any other relevant metadata. Due to the lack of space, we did not
link to the original sources. However, it would be worth to create a
vocals:StreamEndpoint
that allows to track the provenance of the
conversion.
1
2
3
4
5
6
7
8
9
10
<recentchanges> a vocals:StreamDescriptor ; dcat:dataset <wesRCStream> .
<wesRCStream> a vocals:RDFStream ;
dcat:title "Wikimedia Recentchanges Event Stream"^^xsd:string ;
dcat:publisher <http://www.streamreasoning.org> ;
dcat:license <https://creativecommons.org/licenses/by-nc/4.0/> ;
rdfs:seeAlso <http://...mappings.ttl>
rdfs:seeAlso <http://...org/wikimediavocab.owl>
vocals:hasEndpoint [ a vocals:StreamEndpoint ;
dcat:format frmt:JSON-LD;
dcat:accessURL "ws://.../recentchanges" ] .
In the Sgraph, we included a license compatible with the one from WES, and we made the VoCaLS description available as S-GRAPH via REST API. We included a Stream Endpoint that allows to consume the data directly using a WebSocket. Data are originally shared using a document format with a rich schema. Therefore, to preserve the level of granularity, we opted for a graph-base stream data model.
Last but not least, we can include an example of descriptive statistics. For instance, the following Listings show an example of RSP-QL query calculating the stream rate every minute.
REGISTER RSTREAM <outputstream> AS
SELECT (COUNT{*}/60) ?ratesec
FROM NAMED WINDOW <win> ON <http://wikimedia.org/recentchanges/rdf> [RANGE PT60S PT60S]
WHERE { WINDOW <win> { ?s ?p ?o } }
More delicious recipes
This is one of the many fantastic recipes available on this blog