2023 TREC NeuCLIR Track

Version 1.1; 9 May 2023

Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluations for more than twenty years. Prior to the application of deep learning, strong statistical approaches were developed that work well across many languages. As with most other language technologies though, neural computing has led to significant performance improvements in information retrieval. CLIR has just begun to incorporate neural advances.

The TREC 2023 NeuCLIR track presents a cross-language information retrieval challenge. NeuCLIR topics are written in English. NeuCLIR has three target language collections in Chinese, Persian, and Russian. Topics are written in the traditional TREC format: a short title and a sentence-length description. Systems are to return a ranked list of documents for each topic. Results will be pooled, and systems will be evaluated on a range of metrics.

Tasks

The following table lists the five NeuCLIR 2023 tasks, along with the main variants of each:

Collection	Document Language(s)	Query Language	Topic Fields	Full Retrieval (FF) or Reranking (RR)
Single Language News	fas	eng, fas, other	title, desc, both, other	FR, RR
Single Language News	rus	eng, rus, other	title, desc, both, other	FR, RR
Single Language News	zho	eng, zho, other	title, desc, both, other	FR, RR
Multilingual News	fas+rus+zho	eng, fas, rus, zho, other	title, desc, both, other	FR, RR
Single-Language Technical	CSL	eng, zho, other	title, desc, both, other	FR, RR

Single-Language News Retrieval (CLIR and Monolingual) Tasks

The main task in the NeuCLIR track is ad hoc cross-language news retrieval. Systems will receive a document collection in Chinese, Persian, or Russian, and a set of topics in one of English, Chinese, Persian or Russian. For each topic, the system will return a ranked list of 1000 documents drawn from the document collection, ordered by likelihood of relevance to the topic. To facilitate fair comparisons across reranking approaches, the organizers will provide a strong initial ranking of documents.

Topics will be available in Chinese, Persian and Russian. In addition to CLIR runs with English queries (the main task), we invite submissions from systems in which the query language is the same as the document language (i.e., monolingual runs), and submissions from systems in which the query language is neither English nor the document language (non-English CLIR runs). While queries are provided in multiple languages, we encourage the use of a single query language throughout a given retrieval pipeline; this is especially encouraged in reranking, where queries may be ingested by multiple systems.

Multilingual News Retrieval (MLIR) Task 💥 New in 2023! 💥

This task is identical to the Ad Hoc CLIR task, with the exception that for each query, systems must search all three document collections and produce a single ranked list. That is, systems should treat the entirety of the NeuCLIR-1 document collections across all three languages as a single corpus. The topics for the multilingual retrieval task will be identical to those of the ad hoc CLIR task. Participants should be aware that there is no guarantee that the set of relevant documents for a query will include documents from all three languages. As with the CLIR tasks, the organizers will provide a strong initial ranking to allow fair comparisons across reranking approaches for this task.

Technical Documents (CLIR and Monolingual) Pilot Task 💥 New in 2023! 💥

Deadline: August 14, 2023

Domain-specific texts can exhibit writing styles and vocabulary that are difficult for machine translation or multilingual embeddings to handle. The technical document pilot task will examine the feasibility of an ad hoc CLIR task over documents and topics drawn from a variety of technical domains. The task will ask systems to retrieve Chinese technical documents (specifically, abstracts of academic papers and theses) using English queries. The task is identical to the ad hoc news retrieval CLIR task except for the domain and the size of the document collection. In addition to English topics, topics translated into Chinese will be available; submission of monolingual runs using these translations is invited. We also ask that all participants in the Chinese news CLIR task submit one or more baseline technical document runs using their news retrieval system. As with the CLIR tasks, the organizers will provide a strong initial ranking to allow fair comparisons across reranking approaches for this task. Because this is a pilot task, the number of topics will be limited. Topics will be released after the submission deadline of the other tasks.

Formats

Documents

Documents are distributed in JSONL format, with one document in JSON format on each line. The fields present for each document are:

id (string): The document ID for the document
text (string): Complete text of the document
date (string): YYYY-MM-DD or empty string
lang (string): ISO 639-3

The documents can be downloaded from Huggingface Datasets and can be used by various toolkits (including Patapsco) via ir_datasets integration: neuclir.

Participating teams can get a copy of the processed corpora directly from NIST.

The technical documents are an adapted version of the CSL dataset. They can be downloaded from Huggingface Datasets and can be used by various toolkits (including Patapsco).

Ranked Results (Submissions and Reranking Inputs)

NeuCLIR will use the standard TREC ad hoc submission format for submissions and for the baseline ranked lists that serve as input to the reranking task. Each set of ranked results for a set of topics appears in a single file. Each line of this file contains six whitespace-separated entries:

Topic (query) number
The fixed string “Q0”
Document ID
Rank
Score (integer or float)
Run ID

Field 1 is the topic ID taken from the topics file. Entries for each topic must be contiguous in a ranked results file. Field 3 is the document ID, taken from the document collection. The scores in Field 5 must appear in non-increasing order within a given topic. In the reranking input files, these values will be non-increasing, but will otherwise not be meaningful. Note that the TREC evaluation software orders entries with the same score orthographically according to document ID, not with the order in which they appear in the submission file. Field 6 is the run ID, generated by the submitter (see below). Fields 2 and 4 are ignored, although they must be present.

Here is a portion of a sample ranked results file:

Q0 pid1 1 2.73 zho-team1-ACLN_run1
Q0 pid2 2 2.71 zho-team1-ACLN_run1
Q0 pid3 3 2.61 zho-team1-ACLN_run1
Q0 pid4 4 2.05 zho-team1-ACLN_run1
Q0 pid5 5 1.89 zho-team1-ACLN_run1

Up to 1,000 results per query will be accepted. Runs containing queries with more than 1,000 results will be truncated.

Run Names

Run names must begin with a string to identify the source document collection or task followed by a dash, selected from the following options:

zho- for Chinese newswire source documents
fas- for Persian newswire source documents
rus- for Russian newswire source documents
mlir- for the multilingual task (Chinese, Persian, and Russian source documents)
tech- for Chinese technical source documents

The second field of each run name must be your registered team name, again followed by a dash.

To better categorize your runs, a recommended third field comprising all of the following characters that apply to the run followed by an underscore will be helpful:

1 Monolingual run (queries in document language)
2 Reranking of the official reranking source run
A Automatic run (no human intervention of any kind)
M Manual run (i.e., some form of human Intervention used)
T Translated documents (translated using MT)
N Native language documents (i.e., not translated using MT)
E English queries
C Chinese queries
P Persian queries
R Russian queries
S sparse retrieval
D dense retrieval
L learned sparse retrieval
H hybrid method including a combination of dense, sparse, or learned sparse retrieval

So for example, mlir-TEAM3-DEAT_MyAmazingRun represents an automatic multilingual run over machine-translated documents using English queries with dense retrieval, tech-TEAM3-TN_MyNiftyRun represents a technical document task run using both native and machine-translated documents, fas-TEAM3-2N_MyUglyRun represents a reranking run over native Persian language documents, and rus-TEAM3-MyOutrageousRun represents a run with no further information. The informational prefix is recommended but optional; if present, it will be used to ensure the run is properly categorized in the track results.

Run names may follow the fields listed above with any desired text.

QRELS

Relevance judgments are in standard TREC QRELS format. Each line of the QRELS file will contain four whitespace-separated entries:

A topic ID
The string 0 (zero)
A document ID
A relevance judgment drawn from the following set:
- 3: document is fully relevant to the topic (i.e., it contains facts that would be included in lead paragraph of a report on the topic)
- 2: there are no documents with relevant judgement 2
- 1: document is somewhat relevant to topic (i.e., it contains facts that would be included elsewhere in a report on the topic)
- 0: document is not relevant to topic (i.e., it does not contain information that would be included in a report written about the topic)

See the Assessment section below for further explanation of these relevance levels. Qrels files will be distributed with development data but not evaluation data.

Data

Development Data

Two sets of development data are available. The first, called HC4, includes document sets of about 5 million Russian documents and ½ million each of Chinese and Persian documents. There are 60 development topics each in Chinese and Persian, and 54 development topics in Russian. Many of the HC4 documents are in NeuCLIR1. You will be able to score the development topics against the NeuCLIR1 collection if you intersect the two document sets, throw away retrieved documents not in HC4 and filter the qrels for documents in NeuCLIR1. Access to this filtered version of the corpus as well as the development topics/qrels are available automatically through the hc4-filtered datasets in the ir_datasets.

The second development set comprises the NeuCLIR 2022 evaluation set. While the HC4 development data can be useful, the NeuCLIR 2022 evaluation set is a better match to the NeuCLIR 2023 evaluation: the document collection is identical, and the topics were developed in the same way. Topics and qrels for this development set are also available through NIST and in the ir_datasets. Forty-eight of the fifty topics have relevant documents in more than one language; a list of these topics may be found on thetrack’s webpage.

Evaluation Data

NeuCLIR 2023 will use the NeuCLIR-1 document collection, the same collection used for NeuCLIR 2022. The document collections are drawn from the Common Crawl News Collection, spanning a five year time window from August 2016 to July 2021. Very short and very long documents were removed. Automatic deduplication removes duplicate documents from all three collections. Russian was randomly sampled to have roughly 5 million documents post-deduplication. There are 4½ million Russian, 3 million Chinese, and 2 million Persian documents. Information on the document collection is provided above.

Evaluation topics will be released June 5, 2023.

Additional Resources

General

mMARCO: Translations, using OPUS-MT and Google Translate, of MS MARCO into many languages (but not Persian)
- ir_datasets: mmarco
- On the HuggingFace Hub
neuMARCO: Translations of MS MARCO via a sockeye translation system
- neuMSMARCO.tar.gz
- ir_datasets: neumarco
WikiCLIR: Multilingual CLIR collection, including Chinese and Russian
- ir_datasets: wikiclir
CLIRMatrix: Another Multlilingual CLIR collection, including Chinese, Russian, and Persian.
- ir_datasets: clirmatrix (example code)

Chinese

mMARCO Chinese v1 (OPUS MT)
- ir_datasets: mmarco/zh
mMARCO Chinese v2 (Google Translate)
- ir_datasets: mmarco/v2/zh
neuMARCO Chinese
- neuMSMARCO.tar.gz
- ir_datasets: neumarco/zh

Persian

neuMARCO Persian
- neuMSMARCO.tar.gz
- ir_datasets: neumarco/fa

Russian

mMARCO Russian v1 (Opus MT)
- ir_datasets: mmarco/ru
mMARCO Russian v2 (Google Translate)
- ir_datasets: mmarco/v2/ru
neuMARCO Russian
- neuMSMARCO.tar.gz
- ir_datasets: neumarco/ru
Mr. TyDi: Monolingual retrieval in several languages, including Russian
- ir_datasets: mr-tydi

Software

Patapsco

Patapsco is a CLIR framework that makes it easy to get started with CLIR. An overview of Patapsco’s capabilities can be found in this paper, and the code is available from this git repository. A Jupyter notebook that steps through basic Patapsco usage is available here.

Chinese Character Mapping

The Chinese NeuCLIR collection includes both traditional and simplified characters. For those who would like to deal with only one Chinese character set, a script to convert traditional to simplified characters is available at here. Users should bear in mind that the transliteration from traditional to simplified characters is imperfect, and there is no guarantee that the resulting text accurately captures the meaning of the text in the original traditional characters.

Validator

A validator for track submissions will be posted to the Active Participants area on trec.nist.gov.

`ir-measures`

Submissions will be evaluated using ir-measures, which is available here. The software uses the official trec_eval implementation for measures that trec_eval (nDCG, MAP, etc.), and delegates computation of measures unsupported by trec_eval to alternative implementations (e.g., Rank Biased Precision uses cwl_eval). Evaluation will be conducted using the following command:

ir_measures qrels_file run_file 'nDCG@20 MAP RBP(rel=1) R@100 R@1000'

Submission

Runs will be submitted through the NIST submission system. Runs that do not pass validation will be rejected outright. Submitted runs will be asked to specify (a) task; (b) document language(s); (c) query language; (d) topic fields; (e) full retrieval or reranking; (f) neural, statistical, or a combination; and (g) automatic (no human intervention) or manual (human intervention at any point).

You may submit an unlimited number of runs, ordered according to which runs you most want to be part of the evaluation pools. The number of runs from each participating group included in the pools, and the depth to which those runs will be examined, will depend on the number and variety of submissions received. At least three submissions from each group will be included in the pools.

Automatic Runs

Each run submission must indicate whether the run is manual or automatic. An automatic run is any run that receives no human intervention (including your own human translation of the queries). We expect most NeuCLIR runs to be automatic runs. Note that using the provided human and machine translations without further human intervention count as automatic runs. Runs that use the provided human translations at any stage of retrieval will be considered monolingual runs and will be reported separately from cross-language runs.

Manual Runs

Results on manual runs will be specifically identified when results are reported. A manual run is any run for which changes are made to the queries, the system, or the system’s results after the topics have been seen by a person. This includes, for example, manually creating queries from the topic description, or based on manual examination of retrieval results. or implementing new automated processing capabilities such as stop structure removal that are created after the topics have been seen. Simple bug fixes that address only format handling do not result in manual runs, but such changes should be described.

Assessment & Scoring

Relevance assessments for each topic will be made by a single person per language. MLIR submissions will be separated by language when adding documents to the pools.

Scoring will be performed by ir-measures, as described above. The main reported measure will be Normalized Discounted Cumulative Gain at 20 (nDCG@20). Weights for the three levels of relevance are:

Not relevant: 0
Somewhat relevant: 1
Highly relevant: 3

Note that there is no 2 relevance score.

Additional evaluation measures include MAP, RBP, Recall@100, and Recall@1000.

Multilingual runs will be scored using the same measures by merging the qrels files for the three languages into a single multilingual qrels file. Multilingual runs will also be assessed for language fairness using α-nDCG, with each language treated as an aspect.

We solicit self-reported Mean Response Time (MRT) per query and the total number of model parameters for each run. This will allow us to analyze the efficiency of various approaches. We recognize that neither of these measures allows for a perfectly fair comparison (e.g., MRT is hardware-dependent, and the number of parameters in a model does not correlate directly with its efficiency), so these measurements will only be grouped into coarse-grained categories. Reporting these values is optional, and good-faith approximations are permitted.

Carbon Neutrality

Our aim for NeuCLIR is to be a carbon-neutral track. Therefore, we strongly encourage all participants to track carbon emissions associated with their submission, and, if they are able to, buy carbon offsets to minimize the impact of their submissions. For details, please visit this page. Participants will be asked on submission whether they tracked their carbon emissions and whether they offset their impact. However, this will not affect the evaluation of individual runs; the information will be anonymized and will only be reported publicly in aggregate (e.g., “85% of teams tracked their carbon”).

Registration

Sign up with NIST, using the link at the bottom of the Call for Participation. That’s when you choose a team name. If you don’t do this, your team name won’t be in the submission system, and you won’t be able to submit runs. This also gets you one email address on the NIST mailing list, which is where all the NIST coordination information will be sent.
Sign up for our track’s Google group. This is where we discuss things like the track guidelines.

Important Dates

March 2022: Evaluation document collection released
May 11, 2023: Track guidelines released
June 5, 2023: CLIR/MLIR: Topics and baseline runs for reranking released
~~June 30, 2023~~ July 7, 2023: CLIR/MLIR: Submissions due to NIST
June 30, 2023: Technical Document Topic Release
August 1, 2023: Technical Document Submission
September 30, 2023: Results distributed to participants
November 2023: TREC 2023

TREC 2023 Agenda

The NeuCLIR session will be held on Wednesday, November 15, 2023 in person at NIST in Gaithersburg, MD, as well as virtually.

The agenda is as follows (all times are in Eastern Standard Time/UTC -5):

Time	Event
1:00 PM - 1:30 PM	CLIR/MLIR Task Introduction (Dawn & Eugene)
1:30 PM - 2:30 PM	Team Presentations (Various)
2:30 PM - 2:45 PM	Break
2:45 PM - 2:55 PM	Tech Task Introduction (Dawn & Eugene)
2:55 PM - 3:00 PM	Team Presentation (AI2, Luca)
3:00 PM - 4:00 PM	NeuCLIR 2024 Planning Session (Luca & Jim)

The schedule for team presentations is as follows:

Time	Team
1:30 PM - 1:40 PM	Waterloo (Carlos Lassance)
1:40 PM - 1:50 PM	UMass (Zhiqi Huang)
1:50 PM - 2:05 PM	COE (Eugene Yang)
2:05 PM - 2:20 PM	ISI (Scott Miller)
2:20 PM - 2:25 PM	UMD (Suraj Nair)

To register for TREC (either in-person or virtually), please visit the active participants area TREC 2023 website.

Organizers

In alphabetical order:

Dawn Lawrie, Johns Hopkins University, HLTCOE
Sean MacAvaney, University of Glasgow
James Mayfield, Johns Hopkins University, HLTCOE
Paul McNamee, Johns Hopkins University, HLTCOE
Douglas W. Oard, University of Maryland
Luca Soldaini, Allen Institute for AI
Eugene Yang, Johns Hopkins University, HLTCOE

Contact

We encourage all interested researchers to join our mailing list for the latest announcement and news.
For any questions, please reach out to the organizers at neuclir-organizers@googlegroups.com.
Follow us on Twitter at to connect with other NeuCLIR participants.