Vandalism Detection
Adobe
Sponsor
Wikimedia Germany Logo
Supporter

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity. Nevertheless, Wikidata frequently gets vandalized, exposing all its users to the risk of spreading vandalized and falsified information.

Task

Given a Wikidata revision, compute a vandalism score denoting the likelihood of this revision being vandalism (or similarly damaging).

Awards

The three best-performing approaches submitted by eligible participants as per the performance measures used for this task will receive the following awards, kindly sponsored by Adobe Systems, Inc.:

  1. $1500 for the best-performing approach,
  2. $750 for the second best-performing approach, and
  3. $500 for the third best-performing approach.

Furthermore, Wikimedia Germany supports the transfer of the scientific insights gained in this task by inviting the eligible participants who submitted the best-performing approaches to visit them for a couple of days in order to work together on planning a potential integration of the approach into Wikidata.

Task Rules

The goal of the vandalism detection task is to detect vandalism nearly in real time as soon as it happens. Hence, the following rules apply:

  • Use of any additional data that is newer than the provided training data is forbidden. In particular, you may not scrape any Wikimedia website, use the API, the dumps, or any related data source to obtain data that is newer than February 29, 2016.
  • You may use sources of publicly available external data having to do with geographical information, demographic information, natural language processing, etc. This data must not relate to the specific revision label (vandalism vs regular).
Wikidata Vandalism Corpus 2016 Training Dataset

To develop your software, we provide you with a training corpus that consists of Wikidata revisions and whether they are considered vandalism.

The Wikidata Vandalism Corpus 2016 contains revisions of the knowledge base Wikidata. The corpus comprises manual revisions only, while all revisions by official bots were filtered out. For each revision, we provide the information whether it is considered vandalism (ROLLBACK_REVERTED) or not. Unlike the Wikidata dumps, revisions are ordered chronologically by REVISION_ID (i.e., in the order they arrived at Wikidata). For training, we provide data until February 29, 2016. The evaluation will be conducted on later data.

The provided training data consists of 23 files in total. You can check their validty via their md5 checksums.

Revision Data Files (21 files)

Meta File (1 File)

NameTypesDescription
REVISION_ID Integer The Wikidata revision id
REVISION_SESSION_IDInteger The Wikidata revision id of the first revision in this session
USER_COUNTRY_CODE String Country code for IP address (only available for unregistered users)
USER_CONTINENT_CODEString Continent code for IP address (only available for unregistered users)
USER_TIME_ZONE String Time zone for IP address (only available for unregistered users)
USER_REGION_CODE String Region code for IP address (only available for unregistered users)
USER_CITY_NAME String City name for IP address (only available for unregistered users)
USER_COUNTY_NAME String County name for IP address (only available for unregistered users)
REVISION_TAGS List<String> The Wikidata revision tags

Truth File (1 File)

NameTypesDescription
REVISION_ID IntegerThe Wikidata revision id
ROLLBACK_REVERTED BooleanWhether this revision was reverted via the rollback feature
UNDO_RESTORE_REVERTEDBooleanWhether this revision was reverted via the undo/restore feature

The ROLLBACK_REVERTED field encodes the official ground truth for this competition. The UNDO_RESTORE_REVERTED field serves informational purposes only.

The truth file will only be available for the training dataset but not for test datasets.

The corpus can be processed, for example, with Wikidata Toolkit.

Wikidata Vandalism Corpus 2016 Validation Dataset

For validating your software, we provide you with a validation dataset that encompasses the two months succeeding the training dataset. The provided validation data consists of 3 files and you can check their validty via their md5 checksums.

Wikidata Vandalism Corpus 2016 Test Dataset

For the final evaluation of submissions, we used the two months of data succeeding the validation dataset. The data was not publicly released until after the submission deadline. The test data consists of 3 files and you can check their validty via their md5 checksums.

Output

For each Wikidata revision in the test corpus, your software shall output a vandalism score in the range [0,1]. The output shall be formatted as a CSV file in the format RFC4180 and consist of two columns: The first column denotes Wikidata's revision id as an integer and the second column denotes the vandalism score as a float32. Here are a few example rows:

Revision IdVandalism Score
1230.95
1240.30
12512.e-5
Performance Measures

For determining the winner, we use ROC-AUC as primary evaluation measure.

For informational purposes, we might compute further evaluation measures such as PR-AUC and the runtime of the software.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Submission

We ask you to prepare your software so that it can be executed via a command line call with the following parameters.

  > mySoftware -d HOST_NAME:PORT -a AUTHENTICATION_TOKEN
  

The host name, port and authentication token are needed for connecting to the server providing the evaluation data (see below).

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the WSDM Cup 2017. We agree not to share your software with a third party or use it for other purposes than the WSDM Cup 2017.

Receiving Evaluation Data

Your software will receive the uncompressed Wikidata revisions, and the uncompressed meta data via a TCP connection. Your program must send the results back via the same TCP connection. Conceptually, your program works with three byte streams

  1. One stream provides uncompressed Wikidata revisions (in the same format as in the wdvc16_YYYY_MM.xml files)
  2. One stream provides uncompressed meta data (in the same format as in the wdvc16_meta.csv files)
  3. One stream receives the vandalism scores (as specified in the output format on the WSDM Cup website)

All three byte streams are send over a single TCP connection. The simple protocol is as follows:

  1. The client software connects to the server, sends the given authentication token, and terminates the line with '\r\n'
  2. The server sends revisions and meta data in a multiplexed way to the client
    1. Number of meta bytes to be send (encoded as int32 in network byte order)
    2. Meta bytes
    3. Number of revision bytes to be send (encoded as int32 in network byte order)
    4. Revision bytes
  3. The server closes the output socket as soon as there is no more data to send (half close)
  4. The client closes the output socket as soon as there are no more scores to send

The result must be formatted as a RFC4180 CSV file containing the two columns REVISION_ID and VANDALISM_SCORE. You will only receive new revisions when reporting vandalism scores. More precisely, to enable fast and concurrent processing of data, we introduce a backpressure window of k revisions, i.e., you will receive revision n + k as soon as having reported your result for revision n (the exact constant k is still to be determined but you can expect it to be around 16 revisions).

Example Programs

You can find the data server as well as a demo client on the WSDM Cup Github page.

Getting Started

For those wondering how to get started, we recommend the following steps:

  1. For training (on your own machine)
    1. Extract features from the provided training data
    2. Train a classifier on those features and store the classifier in a file
  2. For evaluation (on TIRA)
    1. Load the classifier from the file
    2. For every revision of the evaluation dataset
      1. Extract features (in the same way as during training)
      2. Compute a vandalism score with the classifier
      3. Output the vandalism score
Results

The following table lists the performances achieved by the participating teams:

ROCPRACCPRFRuntimeTeam
0.947020.457570.999090.681970.263700.3803317:11:16Buffaloberry
Rafael Crescenzi, Pablo Albani, Diego Tauziet, Andrés Sebastián D'Ambrosio, Adriana Baravalle, Marcelo Fernandez, Federico Alejandro Garcia Calabria
Austral University, Argentina
0.937080.352300.999000.675280.099430.1733402:47:50Conkerberry
Alexey Grigorev
Searchmetrics, Germany
0.919760.337380.928500.011250.766820.02218104:47:30Loganberry
Qi Zhu, Bingjie Jiang, Liyuan Liu, Jiaming Shen, Ziwei Ji, Hong Wei Ng, Jinwen Xu, Huan Gui
University of Illinois at Urbana-Champaign, United States
0.904870.161810.987930.061040.724440.1125926:37:29Honeyberry
Nishi Kentaro, Iwasawa Hiroki, Makabe Takuya, Murakami Naoya, Sakurada Ryota, Sasaki Mei, Yaku Shinya, Yamazaki Tomoya
Yahoo Japan Corporation, Japan
0.894030.174330.995010.102980.482750.16975189:16:03Riberry
Tuo Yu, Yuhang Wang, Yiran Zhao, Xin Ma‎, Xiaoxiao Wang, Yiwen Xu, Huajie Shao, Dipannita Dey, Honglei Zhuang, Huan Gui, Fangbo Tao
University of Illinois at Urbana-Champaign, United States

Task Chairs

Stefan Heindorf

Stefan Heindorf

Paderborn University

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Task Committee

Gregor Engels

Gregor Engels

Paderborn University

Benno Stein

Benno Stein

Bauhaus-Universität Weimar