NeuCLIR

Logo

Official website for the NeuCLIR track at TREC 2024.

View My GitHub Profile

2023 TREC NeuCLIR Track

Version 1.1; 9 May 2023

Access to 2023 NeurCLIR datasets (documents and queries) require registration to TREC, which is free and can be started from this webpage.
Topics for this year are released! You can download them directly from the NIST website. We released 76 topics, and all of them are aimed to be multilingual topics. So for both CLIR and multilingual submissions, please run and submit results on all the topics.

Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluations for more than twenty years. Prior to the application of deep learning, strong statistical approaches were developed that work well across many languages. As with most other language technologies though, neural computing has led to significant performance improvements in information retrieval. CLIR has just begun to incorporate neural advances.

The TREC 2023 NeuCLIR track presents a cross-language information retrieval challenge. NeuCLIR topics are written in English. NeuCLIR has three target language collections in Chinese, Persian, and Russian. Topics are written in the traditional TREC format: a short title and a sentence-length description. Systems are to return a ranked list of documents for each topic. Results will be pooled, and systems will be evaluated on a range of metrics.

Jump to:


Tasks

The following table lists the five NeuCLIR 2023 tasks, along with the main variants of each:

Collection Document Language(s) Query Language Topic Fields Full Retrieval (FF) or Reranking (RR)
Single Language News fas eng, fas, other title, desc, both, other FR, RR
Single Language News rus eng, rus, other title, desc, both, other FR, RR
Single Language News zho eng, zho, other title, desc, both, other FR, RR
Multilingual News fas+rus+zho eng, fas, rus, zho, other title, desc, both, other FR, RR
Single-Language Technical CSL eng, zho, other title, desc, both, other FR, RR

Single-Language News Retrieval (CLIR and Monolingual) Tasks

The main task in the NeuCLIR track is ad hoc cross-language news retrieval. Systems will receive a document collection in Chinese, Persian, or Russian, and a set of topics in one of English, Chinese, Persian or Russian. For each topic, the system will return a ranked list of 1000 documents drawn from the document collection, ordered by likelihood of relevance to the topic. To facilitate fair comparisons across reranking approaches, the organizers will provide a strong initial ranking of documents.

Topics will be available in Chinese, Persian and Russian. In addition to CLIR runs with English queries (the main task), we invite submissions from systems in which the query language is the same as the document language (i.e., monolingual runs), and submissions from systems in which the query language is neither English nor the document language (non-English CLIR runs). While queries are provided in multiple languages, we encourage the use of a single query language throughout a given retrieval pipeline; this is especially encouraged in reranking, where queries may be ingested by multiple systems.

Multilingual News Retrieval (MLIR) Task đź’Ą New in 2023! đź’Ą

This task is identical to the Ad Hoc CLIR task, with the exception that for each query, systems must search all three document collections and produce a single ranked list. That is, systems should treat the entirety of the NeuCLIR-1 document collections across all three languages as a single corpus. The topics for the multilingual retrieval task will be identical to those of the ad hoc CLIR task. Participants should be aware that there is no guarantee that the set of relevant documents for a query will include documents from all three languages. As with the CLIR tasks, the organizers will provide a strong initial ranking to allow fair comparisons across reranking approaches for this task.

Technical Documents (CLIR and Monolingual) Pilot Task đź’Ą New in 2023! đź’Ą

Deadline: August 14, 2023

Domain-specific texts can exhibit writing styles and vocabulary that are difficult for machine translation or multilingual embeddings to handle. The technical document pilot task will examine the feasibility of an ad hoc CLIR task over documents and topics drawn from a variety of technical domains. The task will ask systems to retrieve Chinese technical documents (specifically, abstracts of academic papers and theses) using English queries. The task is identical to the ad hoc news retrieval CLIR task except for the domain and the size of the document collection. In addition to English topics, topics translated into Chinese will be available; submission of monolingual runs using these translations is invited. We also ask that all participants in the Chinese news CLIR task submit one or more baseline technical document runs using their news retrieval system. As with the CLIR tasks, the organizers will provide a strong initial ranking to allow fair comparisons across reranking approaches for this task. Because this is a pilot task, the number of topics will be limited. Topics will be released after the submission deadline of the other tasks.

Back to top


Formats

Documents

Documents are distributed in JSONL format, with one document in JSON format on each line. The fields present for each document are:

The documents can be downloaded from Huggingface Datasets and can be used by various toolkits (including Patapsco) via ir_datasets integration: neuclir.

Participating teams can get a copy of the processed corpora directly from NIST.

The technical documents are an adapted version of the CSL dataset. They can be downloaded from Huggingface Datasets and can be used by various toolkits (including Patapsco).

Ranked Results (Submissions and Reranking Inputs)

NeuCLIR will use the standard TREC ad hoc submission format for submissions and for the baseline ranked lists that serve as input to the reranking task. Each set of ranked results for a set of topics appears in a single file. Each line of this file contains six whitespace-separated entries:

  1. Topic (query) number
  2. The fixed string “Q0”
  3. Document ID
  4. Rank
  5. Score (integer or float)
  6. Run ID

Field 1 is the topic ID taken from the topics file. Entries for each topic must be contiguous in a ranked results file. Field 3 is the document ID, taken from the document collection. The scores in Field 5 must appear in non-increasing order within a given topic. In the reranking input files, these values will be non-increasing, but will otherwise not be meaningful. Note that the TREC evaluation software orders entries with the same score orthographically according to document ID, not with the order in which they appear in the submission file. Field 6 is the run ID, generated by the submitter (see below). Fields 2 and 4 are ignored, although they must be present.

Here is a portion of a sample ranked results file:

1 Q0 pid1 1 2.73 zho-team1-ACLN_run1
1 Q0 pid2 2 2.71 zho-team1-ACLN_run1
1 Q0 pid3 3 2.61 zho-team1-ACLN_run1
1 Q0 pid4 4 2.05 zho-team1-ACLN_run1
1 Q0 pid5 5 1.89 zho-team1-ACLN_run1

Up to 1,000 results per query will be accepted. Runs containing queries with more than 1,000 results will be truncated.

Run Names

Run names must begin with a string to identify the source document collection or task followed by a dash, selected from the following options:

The second field of each run name must be your registered team name, again followed by a dash.

To better categorize your runs, a recommended third field comprising all of the following characters that apply to the run followed by an underscore will be helpful:

So for example, mlir-TEAM3-DEAT_MyAmazingRun represents an automatic multilingual run over machine-translated documents using English queries with dense retrieval, tech-TEAM3-TN_MyNiftyRun represents a technical document task run using both native and machine-translated documents, fas-TEAM3-2N_MyUglyRun represents a reranking run over native Persian language documents, and rus-TEAM3-MyOutrageousRun represents a run with no further information. The informational prefix is recommended but optional; if present, it will be used to ensure the run is properly categorized in the track results.

Run names may follow the fields listed above with any desired text.

QRELS

Relevance judgments are in standard TREC QRELS format. Each line of the QRELS file will contain four whitespace-separated entries:

  1. A topic ID
  2. The string 0 (zero)
  3. A document ID
  4. A relevance judgment drawn from the following set:
    • 3: document is fully relevant to the topic (i.e., it contains facts that would be included in lead paragraph of a report on the topic)
    • 2: there are no documents with relevant judgement 2
    • 1: document is somewhat relevant to topic (i.e., it contains facts that would be included elsewhere in a report on the topic)
    • 0: document is not relevant to topic (i.e., it does not contain information that would be included in a report written about the topic)

See the Assessment section below for further explanation of these relevance levels. Qrels files will be distributed with development data but not evaluation data.

Back to top


Data

Development Data

Two sets of development data are available. The first, called HC4, includes document sets of about 5 million Russian documents and ½ million each of Chinese and Persian documents. There are 60 development topics each in Chinese and Persian, and 54 development topics in Russian. Many of the HC4 documents are in NeuCLIR1. You will be able to score the development topics against the NeuCLIR1 collection if you intersect the two document sets, throw away retrieved documents not in HC4 and filter the qrels for documents in NeuCLIR1. Access to this filtered version of the corpus as well as the development topics/qrels are available automatically through the hc4-filtered datasets in the ir_datasets.

The second development set comprises the NeuCLIR 2022 evaluation set. While the HC4 development data can be useful, the NeuCLIR 2022 evaluation set is a better match to the NeuCLIR 2023 evaluation: the document collection is identical, and the topics were developed in the same way. Topics and qrels for this development set are also available through NIST and in the ir_datasets. Forty-eight of the fifty topics have relevant documents in more than one language; a list of these topics may be found on thetrack’s webpage.

Evaluation Data

NeuCLIR 2023 will use the NeuCLIR-1 document collection, the same collection used for NeuCLIR 2022. The document collections are drawn from the Common Crawl News Collection, spanning a five year time window from August 2016 to July 2021. Very short and very long documents were removed. Automatic deduplication removes duplicate documents from all three collections. Russian was randomly sampled to have roughly 5 million documents post-deduplication. There are 4½ million Russian, 3 million Chinese, and 2 million Persian documents. Information on the document collection is provided above.

Evaluation topics will be released June 5, 2023.

Additional Resources

General

Chinese

Persian

Russian

Back to top


Software

Patapsco

Patapsco is a CLIR framework that makes it easy to get started with CLIR. An overview of Patapsco’s capabilities can be found in this paper, and the code is available from this git repository. A Jupyter notebook that steps through basic Patapsco usage is available here.

Chinese Character Mapping

The Chinese NeuCLIR collection includes both traditional and simplified characters. For those who would like to deal with only one Chinese character set, a script to convert traditional to simplified characters is available at ​​here. Users should bear in mind that the transliteration from traditional to simplified characters is imperfect, and there is no guarantee that the resulting text accurately captures the meaning of the text in the original traditional characters.

Validator

A validator for track submissions will be posted to the Active Participants area on trec.nist.gov.

ir-measures

Submissions will be evaluated using ir-measures, which is available here. The software uses the official trec_eval implementation for measures that trec_eval (nDCG, MAP, etc.), and delegates computation of measures unsupported by trec_eval to alternative implementations (e.g., Rank Biased Precision uses cwl_eval). Evaluation will be conducted using the following command:

ir_measures qrels_file run_file 'nDCG@20 MAP RBP(rel=1) R@100 R@1000'

Back to top


Submission

Runs will be submitted through the NIST submission system. Runs that do not pass validation will be rejected outright. Submitted runs will be asked to specify (a) task; (b) document language(s); (c) query language; (d) topic fields; (e) full retrieval or reranking; (f) neural, statistical, or a combination; and (g) automatic (no human intervention) or manual (human intervention at any point).

You may submit an unlimited number of runs, ordered according to which runs you most want to be part of the evaluation pools. The number of runs from each participating group included in the pools, and the depth to which those runs will be examined, will depend on the number and variety of submissions received. At least three submissions from each group will be included in the pools.

Automatic Runs

Each run submission must indicate whether the run is manual or automatic. An automatic run is any run that receives no human intervention (including your own human translation of the queries). We expect most NeuCLIR runs to be automatic runs. Note that using the provided human and machine translations without further human intervention count as automatic runs. Runs that use the provided human translations at any stage of retrieval will be considered monolingual runs and will be reported separately from cross-language runs.

Manual Runs

Results on manual runs will be specifically identified when results are reported. A manual run is any run for which changes are made to the queries, the system, or the system’s results after the topics have been seen by a person. This includes, for example, manually creating queries from the topic description, or based on manual examination of retrieval results. or implementing new automated processing capabilities such as stop structure removal that are created after the topics have been seen. Simple bug fixes that address only format handling do not result in manual runs, but such changes should be described.

Assessment & Scoring

Relevance assessments for each topic will be made by a single person per language. MLIR submissions will be separated by language when adding documents to the pools.

Scoring will be performed by ir-measures, as described above. The main reported measure will be Normalized Discounted Cumulative Gain at 20 (nDCG@20). Weights for the three levels of relevance are:

Note that there is no 2 relevance score.

Additional evaluation measures include MAP, RBP, Recall@100, and Recall@1000.

Multilingual runs will be scored using the same measures by merging the qrels files for the three languages into a single multilingual qrels file. Multilingual runs will also be assessed for language fairness using α-nDCG, with each language treated as an aspect.

We solicit self-reported Mean Response Time (MRT) per query and the total number of model parameters for each run. This will allow us to analyze the efficiency of various approaches. We recognize that neither of these measures allows for a perfectly fair comparison (e.g., MRT is hardware-dependent, and the number of parameters in a model does not correlate directly with its efficiency), so these measurements will only be grouped into coarse-grained categories. Reporting these values is optional, and good-faith approximations are permitted.

Back to top


Carbon Neutrality

Our aim for NeuCLIR is to be a carbon-neutral track. Therefore, we strongly encourage all participants to track carbon emissions associated with their submission, and, if they are able to, buy carbon offsets to minimize the impact of their submissions. For details, please visit this page. Participants will be asked on submission whether they tracked their carbon emissions and whether they offset their impact. However, this will not affect the evaluation of individual runs; the information will be anonymized and will only be reported publicly in aggregate (e.g., “85% of teams tracked their carbon”).

Back to top


Registration

Important Dates

Back to top

TREC 2023 Agenda

The NeuCLIR session will be held on Wednesday, November 15, 2023 in person at NIST in Gaithersburg, MD, as well as virtually.

The agenda is as follows (all times are in Eastern Standard Time/UTC -5):

Time Event
1:00 PM - 1:30 PM CLIR/MLIR Task Introduction (Dawn & Eugene)
1:30 PM - 2:30 PM Team Presentations (Various)
2:30 PM - 2:45 PM Break
2:45 PM - 2:55 PM Tech Task Introduction (Dawn & Eugene)
2:55 PM - 3:00 PM Team Presentation (AI2, Luca)
3:00 PM - 4:00 PM NeuCLIR 2024 Planning Session (Luca & Jim)

The schedule for team presentations is as follows:

Time Team
1:30 PM - 1:40 PM Waterloo (Carlos Lassance)
1:40 PM - 1:50 PM UMass (Zhiqi Huang)
1:50 PM - 2:05 PM COE (Eugene Yang)
2:05 PM - 2:20 PM ISI (Scott Miller)
2:20 PM - 2:25 PM UMD (Suraj Nair)

To register for TREC (either in-person or virtually), please visit the active participants area TREC 2023 website.

Back to top


Organizers

In alphabetical order:

Back to top


Contact

Back to top