2013-01-08

The Semantic Web is an Innumerable Corpus

I recently had the opportunity to present my PhD to people who are knowledgeable in related fields (particularly CYC) and have noticed that their questions usually come from me not properly explaining one property of the Semantic Web which I have called the "Innumerable Corpus".

The Innumerable Corpus is defined as: Innumerable triplets describing innumerable subjects expressed using innumerable ontologies. The word innumerable means a "practical infinity", that is, something that is not infinite but is uncountable. This means that there is no one source or one ontology that should be considered as authoritative. There is no guarantee that information of the same type (e.g. people) from the same source will use exactly the same ontologies. A general purpose semweb browser will adapt to RDF data from any source.

The inherent instability in triplets for a subject means that thinking is terms of looser "predicate patterns" is perhaps more useful than expecting conformance to ontologies.

Many Semweb projects scope the Innumerable Corpus property out. There are projects that use RDF as a data transmission language for data that is from controlled sources and in well defined ontologies. In that context a pragmatic approach is to hand produce the displays dependent on rdf:type. By ignoring the Innumerable Corpus property it is also possible to limit a project to a small subset of semweb data, place it locally for speed of access and perform computationally expensive processing over that data. Inferences can be pre-computing inferences before a user needs them.

All research projects have to scope things out for practical reasons but personally I find the Innumerable Corpus the property that most interests me about the semweb. The political argument is an important one; who controls how our knowledge is defined? Sure, authoritative information sources expressed in well known ontologies are important to our shared understanding but what about murky knowledge on the fringes? What about knowledge that is in dispute? What about knowledge that does not easily fit the orthodox ontologies? The decentralized possibilities of the semantic web really struck at the cyberpunk principles I've grown up with.

But there is also a technical reason too. Semweb browsers will eventually elaborate upon a subject by aggregating and inferring over RDF data from multiple sources. The greater the expansion then the greater probability that subjects of the same types will have less consistent predicate sets, particularly as owl:sameAs links are followed and rdf:type definitions expand. Data expansion The number of triplets per subject could be very high due to data expansion. Without filtering, grouping and ordering the displays of all those triplets will cause information overload.

The Innumerable Corpus means that each subject requires a custom produced display of data. This could be derived from how other similar (by rdf:type) subjects are displayed. These custom displays must be produced runtime.

Producing an authoritative display for an rdf:type is inherently fragile because it is difficult to account for missing predicates and additional predicates that do not conform to that display. Even the concept of an authoritative display assumes a single centrally ordained way of showing things. My preferred approach is to use personalization to let users negotiate with the semweb browser as to how they would prefer to see the data.

Plurality, instability and inconsistency are not special cases; The Innumerable Corpus is about accepting plurality, instability and inconsistency as normal and good.