Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity. Nevertheless, Wikidata frequently gets vandalized, exposing all its users to the risk of spreading vandalized and falsified information.
Given a Wikidata revision, compute a vandalism score denoting the likelihood of this revision being vandalism (or similarly damaging).
The three best-performing approaches submitted by eligible participants as per the performance measures used for this task will receive the following awards, kindly sponsored by Adobe Systems, Inc.:
Furthermore, Wikimedia Germany supports the transfer of the scientific insights gained in this task by inviting the eligible participants who submitted the best-performing approaches to visit them for a couple of days in order to work together on planning a potential integration of the approach into Wikidata.
The goal of the vandalism detection task is to detect vandalism nearly in real time as soon as it happens. Hence, the following rules apply:
To develop your software, we provide you with a training corpus that consists of Wikidata revisions and whether they are considered vandalism.
The Wikidata Vandalism Corpus 2016 contains revisions of the knowledge base Wikidata. The corpus comprises manual revisions only, while all revisions by official bots were filtered out. For each revision, we provide the information whether it is considered vandalism (ROLLBACK_REVERTED) or not. Unlike the Wikidata dumps, revisions are ordered chronologically by REVISION_ID (i.e., in the order they arrived at Wikidata). For training, we provide data until February 29, 2016. The evaluation will be conducted on later data.
The provided training data consists of 23 files in total. You can check their validty via their md5 checksums.
Name | Types | Description |
---|---|---|
REVISION_ID | Integer | The Wikidata revision id |
REVISION_SESSION_ID | Integer | The Wikidata revision id of the first revision in this session |
USER_COUNTRY_CODE | String | Country code for IP address (only available for unregistered users) |
USER_CONTINENT_CODE | String | Continent code for IP address (only available for unregistered users) |
USER_TIME_ZONE | String | Time zone for IP address (only available for unregistered users) |
USER_REGION_CODE | String | Region code for IP address (only available for unregistered users) |
USER_CITY_NAME | String | City name for IP address (only available for unregistered users) |
USER_COUNTY_NAME | String | County name for IP address (only available for unregistered users) |
REVISION_TAGS | List<String> | The Wikidata revision tags |
Name | Types | Description |
---|---|---|
REVISION_ID | Integer | The Wikidata revision id |
ROLLBACK_REVERTED | Boolean | Whether this revision was reverted via the rollback feature |
UNDO_RESTORE_REVERTED | Boolean | Whether this revision was reverted via the undo/restore feature |
The ROLLBACK_REVERTED field encodes the official ground truth for this competition. The UNDO_RESTORE_REVERTED field serves informational purposes only.
The truth file will only be available for the training dataset but not for test datasets.
The corpus can be processed, for example, with Wikidata Toolkit.
For validating your software, we provide you with a validation dataset that encompasses the two months succeeding the training dataset. The provided validation data consists of 3 files and you can check their validty via their md5 checksums.
For the final evaluation of submissions, we used the two months of data succeeding the validation dataset. The data was not publicly released until after the submission deadline. The test data consists of 3 files and you can check their validty via their md5 checksums.
For each Wikidata revision in the test corpus, your software shall output a vandalism score in the range [0,1]. The output shall be formatted as a CSV file in the format RFC4180 and consist of two columns: The first column denotes Wikidata's revision id as an integer and the second column denotes the vandalism score as a float32. Here are a few example rows:
Revision Id | Vandalism Score |
---|---|
123 | 0.95 |
124 | 0.30 |
125 | 12.e-5 |
For determining the winner, we use ROC-AUC as primary evaluation measure.
For informational purposes, we might compute further evaluation measures such as PR-AUC and the runtime of the software.
Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.
During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
We ask you to prepare your software so that it can be executed via a command line call with the following parameters.
> mySoftware -d HOST_NAME:PORT -a AUTHENTICATION_TOKEN
The host name, port and authentication token are needed for connecting to the server providing the evaluation data (see below).
You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:
Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the WSDM Cup 2017. We agree not to share your software with a third party or use it for other purposes than the WSDM Cup 2017.
Your software will receive the uncompressed Wikidata revisions, and the uncompressed meta data via a TCP connection. Your program must send the results back via the same TCP connection. Conceptually, your program works with three byte streams
All three byte streams are send over a single TCP connection. The simple protocol is as follows:
The result must be formatted as a RFC4180 CSV file containing the two columns REVISION_ID and VANDALISM_SCORE. You will only receive new revisions when reporting vandalism scores. More precisely, to enable fast and concurrent processing of data, we introduce a backpressure window of k revisions, i.e., you will receive revision n + k as soon as having reported your result for revision n (the exact constant k is still to be determined but you can expect it to be around 16 revisions).
You can find the data server as well as a demo client on the WSDM Cup Github page.
For those wondering how to get started, we recommend the following steps:
The following table lists the performances achieved by the participating teams:
ROC | PR | ACC | P | R | F | Runtime | Team |
---|---|---|---|---|---|---|---|
0.94702 | 0.45757 | 0.99909 | 0.68197 | 0.26370 | 0.38033 | 17:11:16 | Buffaloberry Rafael Crescenzi, Pablo Albani, Diego Tauziet, Andrés Sebastián D'Ambrosio, Adriana Baravalle, Marcelo Fernandez, Federico Alejandro Garcia Calabria Austral University, Argentina |
0.93708 | 0.35230 | 0.99900 | 0.67528 | 0.09943 | 0.17334 | 02:47:50 | Conkerberry Alexey Grigorev Searchmetrics, Germany |
0.91976 | 0.33738 | 0.92850 | 0.01125 | 0.76682 | 0.02218 | 104:47:30 | Loganberry Qi Zhu, Bingjie Jiang, Liyuan Liu, Jiaming Shen, Ziwei Ji, Hong Wei Ng, Jinwen Xu, Huan Gui University of Illinois at Urbana-Champaign, United States |
0.90487 | 0.16181 | 0.98793 | 0.06104 | 0.72444 | 0.11259 | 26:37:29 | Honeyberry Nishi Kentaro, Iwasawa Hiroki, Makabe Takuya, Murakami Naoya, Sakurada Ryota, Sasaki Mei, Yaku Shinya, Yamazaki Tomoya Yahoo Japan Corporation, Japan |
0.89403 | 0.17433 | 0.99501 | 0.10298 | 0.48275 | 0.16975 | 189:16:03 | Riberry Tuo Yu, Yuhang Wang, Yiran Zhao, Xin Ma, Xiaoxiao Wang, Yiwen Xu, Huajie Shao, Dipannita Dey, Honglei Zhuang, Huan Gui, Fangbo Tao University of Illinois at Urbana-Champaign, United States |