2022 TREC NeuCLIR Track

Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluations for more than twenty years. Prior to the application of deep learning, strong statistical approaches were developed that work well across many languages. As with most other language technologies though, neural computing has led to significant performance improvements in information retrieval. CLIR has just begun to incorporate neural advances.

The TREC 2022 NeuCLIR track presents a cross-language information retrieval challenge. NeuCLIR topics are written in English. NeuCLIR has three target language collections in Chinese, Persian, and Russian. Topics are written in the traditional TREC format: a short title and a sentence-length description. Systems are to return a ranked list of documents for each topic. Results will be pooled, and systems will be evaluated on a range of metrics.

Tasks

Ad Hoc CLIR Task

The main task in the NeuCLIR track is ad hoc cross-language information retrieval. Systems will receive a document collection in Chinese, Persian, or Russian, and a set of English topics. For each topic, the system will return a ranked list of 1000 documents drawn from the document collection, ordered by likelihood of relevance to the topic.

Reranking CLIR Task

The reranking task is an extension of the ad hoc task. In addition to a document collection and a set of topics, systems also receive as input for each topic a ranked list of 1000 documents drawn from the document collection. Each ranked list is the output of a retrieval system operating on the ad hoc task. Reranking task results must be drawn only from the documents that appear in these lists. In all other ways, the reranking task is identical to the ad hoc task. We will release runs for reranking for the HC4 training topics.

Monolingual Retrieval Task

While monolingual retrieval is not a focus of the NeuCLIR track, monolingual runs can improve assessment pools, and serve as good points of reference for cross-language runs. The monolingual retrieval task is identical to the ad hoc task, but uses topic files that are human translations of the English topics into the target language in a way that would be expressed by speakers of the language.

Data Formats

Documents

Documents are distributed in JSONL format, with one document in JSON format on each line. The fields present for each document are:

id (string): The document ID for the document
text (string): Complete text of the document
date (string): YYYY-MM-DD or empty string
Lang (string): ISO 639-3

Participating teams can get a copy of the processed corpora directly from NIST. Alternatively, anybody can replicate the corpora by downloading them from the Common Crawl using provided scripts (this takes longer than downloading from NIST).

Ranked Results (Submissions and Reranking Inputs)

NeuCLIR will use the standard TREC ad hoc submission format for submissions and for ranked lists that serve as input to the reranking task. Each set of ranked results for a set of topics appears in a single file. Each line of this file contains six whitespace-separated entries:

Topic (query) number
The fixed string Q0
Document ID
Rank
Score (integer or float)
Run ID

Field 1 is the topic ID taken from the topics file. Entries for each topic must be contiguous in a ranked results file. Field 3 is the document ID, taken from the document collection. The scores in Field 5 must appear in non-increasing order within a given topic. In reranking input files, these values will be non-increasing, but will otherwise not be meaningful. Field 6 is the run ID, generated by the submitter (the first characters of the run ID must be the name of the submitting team). Fields 2 and 4 are ignored, although they must be present. Here is a portion of a sample ranked results file:

1 Q0 pid1    1 2.73 team1-run1
1 Q0 pid2    2 2.71 team1-run1
1 Q0 pid3    3 2.61 team1-run1
1 Q0 pid4    4 2.05 team1-run1
1 Q0 pid5    5 1.89 team1-run1

Up to 1,000 results per query will be accepted. Runs containing queries with more than 1,000 results will be rejected.

QRELS

Relevance judgments are in standard TREC QRELS format. Each line of the QRELS file will contain four whitespace-separated entries:

A topic ID
The string 0 (zero)
A document ID
A relevance judgment drawn from the following set:

3: document is fully relevant to the topic (i.e., it contains facts that would be included in lead paragraph of a report on the topic)
1: document is somewhat relevant to topic (i.e., it contains facts that would be included elsewhere in a report on the topic)
0: document is not relevant to topic (i.e., it does not contain information that would be included in a report written about the topic)

See the assessment section below for further explanation of these relevance levels. Qrels files will be distributed with development data but not evaluation data.

Data

Development Data

Development data, called HC4, include document sets of about 5 million Russian documents and ½ million each of Chinese and Persian documents. There are 60 development topics each in Chinese and Persian, and 54 development topics in Russian. Many of the HC4 documents are in NeuCLIR1. You will be able to score the development topics against the NeuCLIR1 collection if you intersect the two document sets, throw away retrieved documents not in HC4 and filter the qrels for documents in NeuCLIR1. HC4 is available from the ir_datasets package under hc4.

Evaluation Data

NEW: Details about our document collection, NEUCLIR1, are available on this page.

The document collections will be drawn from the Common Crawl News Collection, spanning a five year time window from August 2016 to July 2021. Very short and very long documents were filtered. Russian was randomly sampled to have 5 million documents pre-deduplication. Automatic deduplication removes duplicate documents. There are 4 ½ million Russian, 3 million Chinese, and 2 million Persian documents. Evaluation topics will be released in June 2022.

The download script for the collection is currently available in the NeuCLIR/download-collection repository. A gzipped version of the collection is available here (password will be available on the TREC website).

A list of multilingual topics is available here.

Additional Resources

Multiple languages
- mMARCO: Translations, using OPUS-MT and Google Translate, of MS MARCO into many languages (but not Persian):
  - ir_datasets: mmarco
  - On HuggingFace: unicamp-dl/mmarco
- neuMARCO: Translations of MS MARCO via a sockeye translation system
  - Download link neuMSMARCO.tar.gz
- WikiCLIR: Multilingual CLIR collection, including Chinese and Russian
  - ir_datasets: wikiclir
- CLIRMatrix: Another multi-lingual CLIR collection, including Chinese, Russian, and Persian.
  - ir_datasets: clirmatrix
- Example code
Chinese
- mMARCO Chinese v1 (OPUS MT)
  - ir_datasets: mmarco/zh
- mMARCO Chinese v2 (Google Translate)
  - ir_datasets: mmarco/v2/zh
- neuMARCO Chinese
  - Download link: neuMSMARCO.tar.gz
Persian
- neuMARCO Persian
  - Download link: neuMSMARCO.tar.gz
Russian
- mMARCO Russian v1 (Opus MT)
  - ir_datasets: mmarco/ru
- mMARCO Russian v2 (Google Translate)
  - ir_datasets: mmarco/v2/ru
- neuMARCO Russian
  - Download link: neuMSMARCO.tar.gz
- Mr. TyDi: Monolingual retrieval in several languages, including Russian
  - ir_datasets: mr-tydi

Software

Patapsco

Patapsco is a CLIR framework that makes it easy to get started with CLIR. An overview of Patapsco’s capabilities can be found in this paper, and the code is available from the hltcoe/patapsco repository. A Jupyter notebook that steps through basic Patapsco usage is available here.

Chinese Character Mapping

The Chinese NeuCLIR collection includes both traditional and simplified characters. For those who would like to deal with only one Chinese character set, a script to convert traditional to simplified characters is available in the NeuCLIR/download-collection repository. Users should bear in mind that the transliteration from traditional to simplified characters is imperfect, and there is no guarantee that the resulting text accurately captures the meaning of the text in the original traditional characters.

Validator

A validator for track submissions will be posted to the Active Participants area on trec.nist.gov.

ir-measures

Submissions will be evaluated using ir-measures. The software uses the official trec_eval implementation for measures that trec_eval (nDCG, MAP, etc.), and delegates computation of measures unsupported by trec_eval to alternative implementations (e.g., Rank Biased Precision uses cwl_eval). Evaluation will be conducted using the following command:

ir_measures qrels_file run_file \
    'nDCG@20 MAP RBP(rel=1) R@100 R@1000'

Submission

Runs will be submitted through the NIST submission system at trec.nist.gov/act_part/act_part.html. Runs that do not pass validation will be rejected outright. Submitted runs will be asked to specify (a) single-stage or multi-stage (i.e., ranking or reranking); (b) neural, statistical, or a combination; and (c) manual or automatic.

Automatic Runs

Each run submission must indicate whether the run is manual or automatic. An automatic run is any run that receives no human intervention once the system is started and provided with the task inputs. We expect most NeuCLIR runs to be automatic runs. Note that using the provided human and machine translations without further human intervention count as automatic runs. Runs that use the provided human translations (i.e., the monolingual task) will be reported separately from cross-language runs.

Manual Runs

Results on manual runs will be specifically identified when results are reported. A manual run is any run in which changes are made to the queries, the system, or the system’s results after the topics have been seen. This includes, for example, manually creating queries from the topic description, or based on manual examination of retrieval results. or implementing new automated processing capabilities such as stop structure removal that are created after the topics have been seen. Simple bug fixes that address only format handling do not result in manual runs, but the changes should be described.

Assessment & Scoring

Relevance assessments for each topic will be made by a single person, who would optimally be the person who created the topic.

Scoring will be performed by ir-measures. The main reported measure will be:

Normalized Discounted Cumulative Gain at 20 (nDCG@20). Weights for the three levels of relevance are:
- Not relevant: 0
- Somewhat relevant: 1
- Highly relevant: 3
Additional evaluation measures include MAP, RBP, Recall@100, and Recall@1000.

We also solicit self-reported mean response time per query and the total number of model parameters for each run. This will allow us to analyze the efficiency of various approaches. We recognize that neither of these measures allows for a perfectly fair comparison (e.g., MRT is hardware-dependent and the number of parameters in a model does not correlate directly with its efficiency), so these measurements will only be grouped into coarse-grained categories. Reporting these values is optional, and good-faith approximations are permitted.

Carbon-Neutrality

Our aim for NeuCLIR is to be a carbon-neutral track. Therefore, we strongly encourage all participants to track carbon emissions associated with their submission, and, if they are able to, buy carbon offsets to minimize the impact of their submissions. For details, please visit this page. Participants will be asked on submission whether they tracked their carbon emissions and whether they offset their impact. However, this will not affect the evaluation of individual runs; the information will be anonymized and will only be reported publicly in aggregate (e.g., “85% of teams tracked their carbon”).

Registration

Sign up with NIST, using the link at the bottom of the Call for Participation. That’s when you choose a team name. If you don’t do this, your team name won’t be in the submission system, and you won’t be able to submit runs. This also gets you one email address on the NIST mailing list, which is where all the NIST coordination information will be sent.
Sign up for our track’s Google group. This is where we discuss things like the track guidelines.

Timeline

November 2021: NeuCLIR track is first discussed at TREC 2021; feedback from potential participants is welcome.
December 2021: Release preliminary track guidelines. Complete download, cleaning and formatting of new document collection.
January 2022: Initial training data (HC4) and Patapsco infrastructure released
March 2022: Release new document collection.
May 2022: Topic development complete.
June 2022: Release test topics and baseline runs for re-ranking.
July 2022: Submissions due to NIST
October 2022: Distribute results to participants.
November 2022: TREC 2022.

Organizers

Dawn Lawrie, Johns Hopkins University, HLTCOE
Sean MacAvaney, University of Glasgow
James Mayfield, Johns Hopkins University, HLTCOE
Paul McNamee, Johns Hopkins University, HLTCOE
Douglas W. Oard, University of Maryland
Luca Soldaini, Allen Institute for AI
Eugene Yang, Johns Hopkins University, HLTCOE

Contact

We encourage all interested researchers to join our mailing list for the latest announcement and news.
For any questions, please reach out to the organizers at neuclir-organizers@googlegroups.com.
Follow us on Twitter at to connect with other NeuCLIR participants.