Version 1.1; 9 May 2023
Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluations for more than twenty years. Prior to the application of deep learning, strong statistical approaches were developed that work well across many languages. As with most other language technologies though, neural computing has led to significant performance improvements in information retrieval. CLIR has just begun to incorporate neural advances.
The TREC 2023 NeuCLIR track presents a cross-language information retrieval challenge. NeuCLIR topics are written in English. NeuCLIR has three target language collections in Chinese, Persian, and Russian. Topics are written in the traditional TREC format: a short title and a sentence-length description. Systems are to return a ranked list of documents for each topic. Results will be pooled, and systems will be evaluated on a range of metrics.
The following table lists the five NeuCLIR 2023 tasks, along with the main variants of each:
Collection | Document Language(s) | Query Language | Topic Fields | Full Retrieval (FF) or Reranking (RR) |
---|---|---|---|---|
Single Language News | fas | eng, fas, other | title, desc, both, other | FR, RR |
Single Language News | rus | eng, rus, other | title, desc, both, other | FR, RR |
Single Language News | zho | eng, zho, other | title, desc, both, other | FR, RR |
Multilingual News | fas+rus+zho | eng, fas, rus, zho, other | title, desc, both, other | FR, RR |
Single-Language Technical | CSL | eng, zho, other | title, desc, both, other | FR, RR |
The main task in the NeuCLIR track is ad hoc cross-language news retrieval. Systems will receive a document collection in Chinese, Persian, or Russian, and a set of topics in one of English, Chinese, Persian or Russian. For each topic, the system will return a ranked list of 1000 documents drawn from the document collection, ordered by likelihood of relevance to the topic. To facilitate fair comparisons across reranking approaches, the organizers will provide a strong initial ranking of documents.
Topics will be available in Chinese, Persian and Russian. In addition to CLIR runs with English queries (the main task), we invite submissions from systems in which the query language is the same as the document language (i.e., monolingual runs), and submissions from systems in which the query language is neither English nor the document language (non-English CLIR runs). While queries are provided in multiple languages, we encourage the use of a single query language throughout a given retrieval pipeline; this is especially encouraged in reranking, where queries may be ingested by multiple systems.
This task is identical to the Ad Hoc CLIR task, with the exception that for each query, systems must search all three document collections and produce a single ranked list. That is, systems should treat the entirety of the NeuCLIR-1 document collections across all three languages as a single corpus. The topics for the multilingual retrieval task will be identical to those of the ad hoc CLIR task. Participants should be aware that there is no guarantee that the set of relevant documents for a query will include documents from all three languages. As with the CLIR tasks, the organizers will provide a strong initial ranking to allow fair comparisons across reranking approaches for this task.
Deadline: August 14, 2023
Domain-specific texts can exhibit writing styles and vocabulary that are difficult for machine translation or multilingual embeddings to handle. The technical document pilot task will examine the feasibility of an ad hoc CLIR task over documents and topics drawn from a variety of technical domains. The task will ask systems to retrieve Chinese technical documents (specifically, abstracts of academic papers and theses) using English queries. The task is identical to the ad hoc news retrieval CLIR task except for the domain and the size of the document collection. In addition to English topics, topics translated into Chinese will be available; submission of monolingual runs using these translations is invited. We also ask that all participants in the Chinese news CLIR task submit one or more baseline technical document runs using their news retrieval system. As with the CLIR tasks, the organizers will provide a strong initial ranking to allow fair comparisons across reranking approaches for this task. Because this is a pilot task, the number of topics will be limited. Topics will be released after the submission deadline of the other tasks.
Documents are distributed in JSONL format, with one document in JSON format on each line. The fields present for each document are:
The documents can be downloaded from Huggingface Datasets and can be used by various toolkits (including Patapsco) via ir_datasets
integration: neuclir.
Participating teams can get a copy of the processed corpora directly from NIST.
The technical documents are an adapted version of the CSL dataset. They can be downloaded from Huggingface Datasets and can be used by various toolkits (including Patapsco).
NeuCLIR will use the standard TREC ad hoc submission format for submissions and for the baseline ranked lists that serve as input to the reranking task. Each set of ranked results for a set of topics appears in a single file. Each line of this file contains six whitespace-separated entries:
Field 1 is the topic ID taken from the topics file. Entries for each topic must be contiguous in a ranked results file. Field 3 is the document ID, taken from the document collection. The scores in Field 5 must appear in non-increasing order within a given topic. In the reranking input files, these values will be non-increasing, but will otherwise not be meaningful. Note that the TREC evaluation software orders entries with the same score orthographically according to document ID, not with the order in which they appear in the submission file. Field 6 is the run ID, generated by the submitter (see below). Fields 2 and 4 are ignored, although they must be present.
Here is a portion of a sample ranked results file:
1 Q0 pid1 1 2.73 zho-team1-ACLN_run1
1 Q0 pid2 2 2.71 zho-team1-ACLN_run1
1 Q0 pid3 3 2.61 zho-team1-ACLN_run1
1 Q0 pid4 4 2.05 zho-team1-ACLN_run1
1 Q0 pid5 5 1.89 zho-team1-ACLN_run1
Up to 1,000 results per query will be accepted. Runs containing queries with more than 1,000 results will be truncated.
Run names must begin with a string to identify the source document collection or task followed by a dash, selected from the following options:
zho-
for Chinese newswire source documentsfas-
for Persian newswire source documentsrus-
for Russian newswire source documentsmlir-
for the multilingual task (Chinese, Persian, and Russian source documents)tech-
for Chinese technical source documentsThe second field of each run name must be your registered team name, again followed by a dash.
To better categorize your runs, a recommended third field comprising all of the following characters that apply to the run followed by an underscore will be helpful:
1
Monolingual run (queries in document language)2
Reranking of the official reranking source runA
Automatic run (no human intervention of any kind)M
Manual run (i.e., some form of human Intervention used)T
Translated documents (translated using MT)N
Native language documents (i.e., not translated using MT)E
English queriesC
Chinese queriesP
Persian queriesR
Russian queriesS
sparse retrievalD
dense retrievalL
learned sparse retrievalH
hybrid method including a combination of dense, sparse, or learned sparse retrievalSo for example, mlir-TEAM3-DEAT_MyAmazingRun
represents an automatic multilingual run over machine-translated documents using English queries with dense retrieval, tech-TEAM3-TN_MyNiftyRun
represents a technical document task run using both native and machine-translated documents, fas-TEAM3-2N_MyUglyRun
represents a reranking run over native Persian language documents, and rus-TEAM3-MyOutrageousRun
represents a run with no further information. The informational prefix is recommended but optional; if present, it will be used to ensure the run is properly categorized in the track results.
Run names may follow the fields listed above with any desired text.
Relevance judgments are in standard TREC QRELS format. Each line of the QRELS file will contain four whitespace-separated entries:
0
(zero)3
: document is fully relevant to the topic (i.e., it contains facts that would be included in lead paragraph of a report on the topic)2
: there are no documents with relevant judgement 2
1
: document is somewhat relevant to topic (i.e., it contains facts that would be included elsewhere in a report on the topic)0
: document is not relevant to topic (i.e., it does not contain information that would be included in a report written about the topic)See the Assessment section below for further explanation of these relevance levels. Qrels files will be distributed with development data but not evaluation data.
Two sets of development data are available. The first, called HC4, includes document sets of about 5 million Russian documents and ½ million each of Chinese and Persian documents. There are 60 development topics each in Chinese and Persian, and 54 development topics in Russian. Many of the HC4 documents are in NeuCLIR1. You will be able to score the development topics against the NeuCLIR1 collection if you intersect the two document sets, throw away retrieved documents not in HC4 and filter the qrels for documents in NeuCLIR1. Access to this filtered version of the corpus as well as the development topics/qrels are available automatically through the hc4-filtered datasets in the ir_datasets
.
The second development set comprises the NeuCLIR 2022 evaluation set. While the HC4 development data can be useful, the NeuCLIR 2022 evaluation set is a better match to the NeuCLIR 2023 evaluation: the document collection is identical, and the topics were developed in the same way. Topics and qrels for this development set are also available through NIST and in the ir_datasets
. Forty-eight of the fifty topics have relevant documents in more than one language; a list of these topics may be found on thetrack’s webpage.
NeuCLIR 2023 will use the NeuCLIR-1 document collection, the same collection used for NeuCLIR 2022. The document collections are drawn from the Common Crawl News Collection, spanning a five year time window from August 2016 to July 2021. Very short and very long documents were removed. Automatic deduplication removes duplicate documents from all three collections. Russian was randomly sampled to have roughly 5 million documents post-deduplication. There are 4½ million Russian, 3 million Chinese, and 2 million Persian documents. Information on the document collection is provided above.
Evaluation topics will be released June 5, 2023.
ir_datasets
: mmarco
neuMSMARCO.tar.gz
ir_datasets
: neumarco
ir_datasets
: wikiclir
ir_datasets
: clirmatrix
(example code)ir_datasets
: mmarco/zh
ir_datasets
: mmarco/v2/zh
neuMSMARCO.tar.gz
ir_datasets
: neumarco/zh
neuMSMARCO.tar.gz
ir_datasets
: neumarco/fa
ir_datasets
: mmarco/ru
ir_datasets
: mmarco/v2/ru
neuMSMARCO.tar.gz
ir_datasets
: neumarco/ru
ir_datasets
: mr-tydi
Patapsco is a CLIR framework that makes it easy to get started with CLIR. An overview of Patapsco’s capabilities can be found in this paper, and the code is available from this git repository. A Jupyter notebook that steps through basic Patapsco usage is available here.
The Chinese NeuCLIR collection includes both traditional and simplified characters. For those who would like to deal with only one Chinese character set, a script to convert traditional to simplified characters is available at ​​here. Users should bear in mind that the transliteration from traditional to simplified characters is imperfect, and there is no guarantee that the resulting text accurately captures the meaning of the text in the original traditional characters.
A validator for track submissions will be posted to the Active Participants area on trec.nist.gov.
ir-measures
Submissions will be evaluated using ir-measures, which is available here. The software uses the official trec_eval
implementation for measures that trec_eval
(nDCG, MAP, etc.), and delegates computation of measures unsupported by trec_eval
to alternative implementations (e.g., Rank Biased Precision uses cwl_eval
). Evaluation will be conducted using the following command:
ir_measures qrels_file run_file 'nDCG@20 MAP RBP(rel=1) R@100 R@1000'
Runs will be submitted through the NIST submission system. Runs that do not pass validation will be rejected outright. Submitted runs will be asked to specify (a) task; (b) document language(s); (c) query language; (d) topic fields; (e) full retrieval or reranking; (f) neural, statistical, or a combination; and (g) automatic (no human intervention) or manual (human intervention at any point).
You may submit an unlimited number of runs, ordered according to which runs you most want to be part of the evaluation pools. The number of runs from each participating group included in the pools, and the depth to which those runs will be examined, will depend on the number and variety of submissions received. At least three submissions from each group will be included in the pools.
Each run submission must indicate whether the run is manual or automatic. An automatic run is any run that receives no human intervention (including your own human translation of the queries). We expect most NeuCLIR runs to be automatic runs. Note that using the provided human and machine translations without further human intervention count as automatic runs. Runs that use the provided human translations at any stage of retrieval will be considered monolingual runs and will be reported separately from cross-language runs.
Results on manual runs will be specifically identified when results are reported. A manual run is any run for which changes are made to the queries, the system, or the system’s results after the topics have been seen by a person. This includes, for example, manually creating queries from the topic description, or based on manual examination of retrieval results. or implementing new automated processing capabilities such as stop structure removal that are created after the topics have been seen. Simple bug fixes that address only format handling do not result in manual runs, but such changes should be described.
Relevance assessments for each topic will be made by a single person per language. MLIR submissions will be separated by language when adding documents to the pools.
Scoring will be performed by ir-measures
, as described above. The main reported measure will be Normalized Discounted Cumulative Gain at 20 (nDCG@20). Weights for the three levels of relevance are:
0
1
3
Note that there is no 2
relevance score.
Additional evaluation measures include MAP, RBP, Recall@100, and Recall@1000.
Multilingual runs will be scored using the same measures by merging the qrels files for the three languages into a single multilingual qrels file. Multilingual runs will also be assessed for language fairness using α-nDCG, with each language treated as an aspect.
We solicit self-reported Mean Response Time (MRT) per query and the total number of model parameters for each run. This will allow us to analyze the efficiency of various approaches. We recognize that neither of these measures allows for a perfectly fair comparison (e.g., MRT is hardware-dependent, and the number of parameters in a model does not correlate directly with its efficiency), so these measurements will only be grouped into coarse-grained categories. Reporting these values is optional, and good-faith approximations are permitted.
Our aim for NeuCLIR is to be a carbon-neutral track. Therefore, we strongly encourage all participants to track carbon emissions associated with their submission, and, if they are able to, buy carbon offsets to minimize the impact of their submissions. For details, please visit this page. Participants will be asked on submission whether they tracked their carbon emissions and whether they offset their impact. However, this will not affect the evaluation of individual runs; the information will be anonymized and will only be reported publicly in aggregate (e.g., “85% of teams tracked their carbon”).
The NeuCLIR session will be held on Wednesday, November 15, 2023 in person at NIST in Gaithersburg, MD, as well as virtually.
The agenda is as follows (all times are in Eastern Standard Time/UTC -5):
Time | Event |
---|---|
1:00 PM - 1:30 PM | CLIR/MLIR Task Introduction (Dawn & Eugene) |
1:30 PM - 2:30 PM | Team Presentations (Various) |
2:30 PM - 2:45 PM | Break |
2:45 PM - 2:55 PM | Tech Task Introduction (Dawn & Eugene) |
2:55 PM - 3:00 PM | Team Presentation (AI2, Luca) |
3:00 PM - 4:00 PM | NeuCLIR 2024 Planning Session (Luca & Jim) |
The schedule for team presentations is as follows:
Time | Team |
---|---|
1:30 PM - 1:40 PM | Waterloo (Carlos Lassance) |
1:40 PM - 1:50 PM | UMass (Zhiqi Huang) |
1:50 PM - 2:05 PM | COE (Eugene Yang) |
2:05 PM - 2:20 PM | ISI (Scott Miller) |
2:20 PM - 2:25 PM | UMD (Suraj Nair) |
To register for TREC (either in-person or virtually), please visit the active participants area TREC 2023 website.
In alphabetical order: