Triple Scoring
Adobe
Sponsor

Knowledge base queries typically produce a list of entities. For reasons similar as in full-text search, it is usually desirable to rank these entities. A basic ingredient in such a ranking are relevance scores for individual triples.

Page last updated on 09-01-2017: the submission deadline is over and the test data is now available for download, see section "Output / Test data" below.

Task

Given a triple from a "type-like" relation, compute a score that measures the relevance of the statement expressed by the triple compared to other triples from the same relation.

Note: read on to understand the emphasis on "type-like" relations. In a nutshell, these are the relations for which relevance scores are needed most. The task focuses on two such relations: "profession" and "nationality".

Awards

The three best-performing approaches submitted by eligible participants as per the performance measures used for this task will receive the following awards, kindly sponsored by Adobe Systems, Inc.:

  1. $1500 for the best-performing approach,
  2. $750 for the second best-performing approach, and
  3. $500 for the third best-performing approach.

Task Rules

You are free to use all of the data provided in the next section, but you do not have to use all of it, and you may use any kind or amount of other data as well.

You are also free to use an arbitrary amount of computation.

However, you should not generate or make use of large amounts of human judgements, in addition to the ones provided in the .train files in the next section.

Input / Training data

We provide the following text files. You can just click on the link and look at the file in your browser. At the end of the list is a link to a ZIP archive containing all the files. Below the list we provide some more explanations.

Note: some of the filenames have been changed slightly on 16-09-2016. The contents of the file is still exactly the same, however. We think the new file names are clearer.

profession.kb   all professions for a set of 343,329 persons
profession.train   relevance scores for 515 tuples (pertaining to 134 persons) from profession.kb
nationality.kb   all nationalities for a set of 301,590 persons
nationality.train   relevance scores for 162 tuples (pertaining to 77 persons) from nationality.kb
professions   the 200 different professions from professions.kb (for your convenience)
nationalities   the 100 different nationalities from nationalities.kb (for your convenience)
persons   385,426 different person names from the two .kb files and their Freebase ids (for your convenience)
wiki-sentences   33,159,353 sentences from Wikipedia with annotations of these 385,426 persons (can but does not have to be used)

triple-scoring.zip   a ZIP file containing all of the files above (1.5 GB compressed, 4.2 GB uncompressed)

Some more explanations:

  • The two .kb files were extracted from a 14-04-2014 dump of Freebase. This is not important for this task, however. Just in case you were curious.
  • The training sets (the .train files provided above) contain only tuples from the respective .kb files. The same will hold true for the test sets (provided after the submission deadline, and on which your submission will be evaluated).
  • When working on the task you will realize that the two training sets are not sufficient on their own, but that you need additional data. In particular, there will be professions / nationalities in the test set for which there is no tuple in the training set.
  • The wiki-sentences are just one example of such additional data, provided above to make it easier for you to get started. Feel free to use any other data instead or in addition. The only thing you are not allowed to use is additional training data generated from human judgement.
  • We limited the set of professions / nationalities to 200 / 100 to make the task feasible for you, since you probably want to learn something for each profession / nationality.
  • The contents of the files professions and nationalities is redundant and they are provided just for your convenience. It's exactly the set of distinct professions / nationalities in the second column of the two .kb files.
  • The file person contains a few person names that occur in neither of the two .kb files. Does no harm though.
  • The person names are exactly the names used by the English Wikipedia. That is, http://en.wikipedia.org/wiki/<person name> takes you to the respective Wikipedia page.
  • The Freebase ids provided in the persons file might be useful if you want to work with a dataset like FACC1 (which is analagous to the wiki-sentences provided above, but for ClueWeb instead of Wikipedia). You don't have to though.
  • For each of the names in persons, there are sentences in wiki-sentences (68,662 sentences for the most frequently mentioned person, 3 sentences for the least frequently mentioned person).
  • As mentioned in the task rules above: feel free to use the provided data, but feel equally free to use any kind or amount of additional data (except for human judgments for the person-profession/nationalities pairs in the .all files).
Output / Test data

Your software will be evaluated on two test sets (one for professions and one for nationalities) of exactly the same nature as the two trainings sets (the .train files) above. The test sets will be subsets of the .all files above, but with scores like in the .train files.

Your software should produce an output exactly like in the .train files above. That is, given a test file, append an additional column (tab-seperated, like for all files in this task) with the score, which should be an integer from the range 0..7

Your software has to figure out whether it is being fed the test file with professions or nationalities (see the section below for the command line call). It can tell this from the base of the file name, that is, the part before the first dot. The base names of the test sets will be profession and nationality, just as for the training sets above.

Here is the script that we will use for the evaluation, and that (of course) you can use, too:

evaluator.py

It is written in Python3. You get a short usage info with python evaluator.py -h, and a longer explanation in the comment at the beginning of the script. The script also tests whether the formatting of the input files is correct, and if not, tells you how and where not. The three measures evaluated are explained in the next section.

Update 08-11-2016: the script can now also be used to evaluate multiple run-truth pairs (in particular, for a joint evaluation of your performance on the profession and nationality test set, as it will be done after the submission deadline). The numbers are then for the unions of the pairs, that is, as if all the run files and all the truth files were concatenated. Note that you can also still run the script for a single run-truth pair as before.

Update 09-01-2017: the submission deadline is over and the test data is now public:

profession.test   relevance scores for 513 tuples (pertaining to 134 persons) from profession.kb (see above)
nationality.test   relevance scores for 197 tuples (pertaining to 96 persons) from nationality.kb (see above)

Performance Measures

The scores in the train and test files have been obtained via crowdsourcing. Each tuple (<person> <profession> or <person> <nationality>) has been judged by 7 human judges. Each judgement is binary: primarily relevant (= 1) or secondarily relevant (= 0). Note that all our tuples are "correct", so there is no category "irrelevant" here (in the rare case that a tuple is incorrect, judges will label it 0). The 7 judgements per triple are added up, which gives an integer score in the range 0..7.

We evaluate three relevance measures, two score-based and one rank-based:

Average score difference: for each triple, take the absolute difference of the relevance score computed by your system and the score from the ground truth; add up these differences and divide by the number of triples.

Accuracy: the percentage of triples for which the score computed by your system differs from the score from the ground truth by at most 2.

Kendall's Tau: for each relation, for each subject, compute the ranking of all triples with that subject and relation according to the scores computed by your system and the score from the ground truth. Compute the difference of the two rankings using Kendall's Tau. See the (well-documented) code of the evaluator.py script above for how ties are handled.

More details on the crowdsourcing task used to obtain the ground truth scores, on the performance measures, and on a number of baselines for solving the task can be found in the SIGIR paper cited in the "Related Work" section below.

The award will go the system/team that achieves the highest accuracy on the combination of both test sets (profession and nationality). In our final report about the competition, we will report results for all three performance measures.

Submission

We ask you to prepare your software so that it can be executed via a command line call.

 > mySoftware -i path/to/input/file -o path/to/output/directory

The name of the output file (to be written to the path/to/output/directory folder) must be the same as the name of the input file. There can be more than one -i argument. In that case your software should process each of the runs and produce an output one file for each.

For example, if your software is called like this:

 > mySoftware -i /dataset/profession.test -i /dataset/nationality.test -o /output

It should write files profession.test and nationality.test to the folder /output, and the two files should be identical to the two input files, except that they contain an additional column with the scores (from the integer range 0..7).

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the WSDM Cup 2017. We agree not to share your software with a third party or use it for other purposes than the WSDM Cup 2017.

Related Work

Hannah Bast, Björn Buchhold, and Elmar Haußmann. Relevance Scores for Triples from Type-Like Relations. In SIGIR 2015: 243 -- 252.

Hannah Bast, Björn Buchhold, and Elmar Haußmann. Semantic Search on Text and Knowledge Bases. In FnTIR 10(2-3): 119 -- 271 (2016).

Task Chairs

Hannah Bast

Hannah Bast

University of Freiburg

NN

Björn Buchhold

University of Freiburg

NN

Elmar Haussmann

University of Freiburg