Knowledge base queries typically produce a list of entities. For reasons similar as in full-text search, it is usually desirable to rank these entities. A basic ingredient in such a ranking are relevance scores for individual triples.
Page last updated on 09-01-2017: the submission deadline is over and the test data is now available for download, see section "Output / Test data" below.
Given a triple from a "type-like" relation, compute a score that measures the relevance of the statement expressed by the triple compared to other triples from the same relation.
Note: read on to understand the emphasis on "type-like" relations. In a nutshell, these are the relations for which relevance scores are needed most. The task focuses on two such relations: "profession" and "nationality".
The three best-performing approaches submitted by eligible participants as per the performance measures used for this task will receive the following awards, kindly sponsored by Adobe Systems, Inc.:
You are free to use all of the data provided in the next section, but you do not have to use all of it, and you may use any kind or amount of other data as well.
You are also free to use an arbitrary amount of computation.
However, you should not generate or make use of large amounts of human judgements, in addition to the ones provided in the .train files in the next section.
We provide the following text files. You can just click on the link and look at the file in your browser. At the end of the list is a link to a ZIP archive containing all the files. Below the list we provide some more explanations.
Note: some of the filenames have been changed slightly on 16-09-2016. The contents of the file is still exactly the same, however. We think the new file names are clearer.
|profession.kb||all professions for a set of 343,329 persons|
|profession.train||relevance scores for 515 tuples (pertaining to 134 persons) from profession.kb|
|nationality.kb||all nationalities for a set of 301,590 persons|
|nationality.train||relevance scores for 162 tuples (pertaining to 77 persons) from nationality.kb|
|professions||the 200 different professions from professions.kb (for your convenience)|
|nationalities||the 100 different nationalities from nationalities.kb (for your convenience)|
|persons||385,426 different person names from the two .kb files and their Freebase ids (for your convenience)|
|wiki-sentences||33,159,353 sentences from Wikipedia with annotations of these 385,426 persons (can but does not have to be used)|
|triple-scoring.zip||a ZIP file containing all of the files above (1.5 GB compressed, 4.2 GB uncompressed)|
Some more explanations:
Your software will be evaluated on two test sets (one for professions and one for nationalities) of exactly the same nature as the two trainings sets (the .train files) above. The test sets will be subsets of the .all files above, but with scores like in the .train files.
Your software should produce an output exactly like in the .train files above. That is, given a test file, append an additional column (tab-seperated, like for all files in this task) with the score, which should be an integer from the range 0..7
Your software has to figure out whether it is being fed the test file with professions or nationalities (see the section below for the command line call). It can tell this from the base of the file name, that is, the part before the first dot. The base names of the test sets will be profession and nationality, just as for the training sets above.
Here is the script that we will use for the evaluation, and that (of course) you can use, too:
Update 08-11-2016: the script can now also be used to evaluate multiple run-truth pairs (in particular, for a joint evaluation of your performance on the profession and nationality test set, as it will be done after the submission deadline). The numbers are then for the unions of the pairs, that is, as if all the run files and all the truth files were concatenated. Note that you can also still run the script for a single run-truth pair as before.
Update 09-01-2017: the submission deadline is over and the test data is now public:
|profession.test||relevance scores for 513 tuples (pertaining to 134 persons) from profession.kb (see above)|
|nationality.test||relevance scores for 197 tuples (pertaining to 96 persons) from nationality.kb (see above)|
The scores in the train and test files have been obtained via crowdsourcing. Each tuple (<person> <profession> or <person> <nationality>) has been judged by 7 human judges. Each judgement is binary: primarily relevant (= 1) or secondarily relevant (= 0). Note that all our tuples are "correct", so there is no category "irrelevant" here (in the rare case that a tuple is incorrect, judges will label it 0). The 7 judgements per triple are added up, which gives an integer score in the range 0..7.
We evaluate three relevance measures, two score-based and one rank-based:
Average score difference: for each triple, take the absolute difference of the relevance score computed by your system and the score from the ground truth; add up these differences and divide by the number of triples.
Accuracy: the percentage of triples for which the score computed by your system differs from the score from the ground truth by at most 2.
Kendall's Tau: for each relation, for each subject, compute the ranking of all triples with that subject and relation according to the scores computed by your system and the score from the ground truth. Compute the difference of the two rankings using Kendall's Tau. See the (well-documented) code of the evaluator.py script above for how ties are handled.
More details on the crowdsourcing task used to obtain the ground truth scores, on the performance measures, and on a number of baselines for solving the task can be found in the SIGIR paper cited in the "Related Work" section below.
The award will go the system/team that achieves the highest accuracy on the combination of both test sets (profession and nationality). In our final report about the competition, we will report results for all three performance measures.
We ask you to prepare your software so that it can be executed via a command line call.
> mySoftware -i path/to/input/file -o path/to/output/directory
The name of the output file (to be written to the path/to/output/directory folder) must be the same as the name of the input file. There can be more than one -i argument. In that case your software should process each of the runs and produce an output one file for each.
For example, if your software is called like this:
> mySoftware -i /dataset/profession.test -i /dataset/nationality.test -o /output
It should write files profession.test and nationality.test to the folder /output, and the two files should be identical to the two input files, except that they contain an additional column with the scores (from the integer range 0..7).
You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:
Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the WSDM Cup 2017. We agree not to share your software with a third party or use it for other purposes than the WSDM Cup 2017.
Hannah Bast, Björn Buchhold, and Elmar Haußmann. Relevance Scores for Triples from Type-Like Relations. In SIGIR 2015: 243 -- 252.
Hannah Bast, Björn Buchhold, and Elmar Haußmann. Semantic Search on Text and Knowledge Bases. In FnTIR 10(2-3): 119 -- 271 (2016).