2012-08-28

PhD Progress Report to August 2012

It's been awhile since I have posted on this topic so here's what I've been doing with my PhD. I have since made several improvement to the semantic web browser described in my last PhD update and have been experimenting with techniques to reduce information overload.

Reminder: My research tries to reduce information overload when displaying data where the structure of that data is not known ahead of time.

There is a good body of research in linking determining how related text strings are by meaning (the linguistics meaning of semantics). The problem for me is that the speed is not fast enough for building displays at runtime in a web-browser. I instead went with lexical methods that use naïve rules to just compare characters in a label. There are well establish algorithms for this; Dice, Levenshtein to name a two. My implementation of Dice in Javascript is now in the WikiBook project called Algorithm Implementation

What is not immediately clear is how these algorithms should determine similarity when two triplets with labels resolved are compared. Assuming that the subject is equal/equivalent then just how does x->firstname:John relate to x-->surname:Xeedown? My most recent chapter took Dice and Levenshtein string similarity algorithms and then compared an averaging of the predicate and object similarities versus simply taking the higher of the predicate or object similarities. I conducted a study with twenty human participants and have analysed the data. The data is indicating that a naïve lexical algorithm can give acceptable (i.e. “usefully better than random”) results in line with what human participants would say about the same triplet pairs. This is encouraging.

In that study I also tested any algorithm for determining if triplets were redundant. This is used to subsume (not display) triplets when it is deemed that other triplets contain equivalent information. I already had built such an algorithm using intuition and voodoo – and the testing on humans indicates that the algorithm is usefully better than random at matching what humans also rank as redundant data. This is also encouraging.

Using the redundancy algorithms I now manage to avoid displaying quite a number of triplets. In some cases, using data from dbpedia, about 55% of triplets are simply not displayed because they contain redundant information or are turned into lists.

I also used the similarity algorithms (actually the inverse) as the distance metric in hierarchical clustering of triplets. The cluster hierarchy is then flattened which results in groups of related triplets. While not giving perfect results, for a “first naïve attempt” at grouping triplets the results appear to be better than random – though I have not tested this.

I am currently writing up research results and learning tons about statistics as I go. It is particularly interesting to read about the debates in statistics for multi-rater agreement in Likert scales. After trying to come to grips with so much math I eventually went with the old method of just just looking at the shape of the of raters histogram.

Going forward from here is the next chapter. This one leaps off from the “first attempt” approaches described above and attempts to find algorithms that learn the user's preferences for ordering, grouping and redundancy. Recommender systems research has some particularly interesting avenues to explore here. I currently invisage a conversational User Interface that allows the user to express their data display preferences (order, grouping, redundancy etc) via direct manipulation.

The performance of any algorithms found and adjusted will then be tested against human raters. Once that study is complete I then theoretically switch into “write up” mode for the thesis but will need the ocassional coding distraction of actually implementing the results of the studies into Eme. Things are looking promising.