Cross-language Information Retrieval (CLIR) has been studied at TREC and subsequent evaluations for more than twenty years. Prior to the application of deep learning, strong statistical approaches were developed that work well across many languages. As with most other language technologies though, neural computing has led to significant performance improvements in information retrieval. CLIR has just begun to incorporate neural advances.
The TREC 2022 NeuCLIR track presents a cross-language information retrieval challenge. NeuCLIR topics are written in English. NeuCLIR has three target language collections in Chinese, Persian, and Russian. Topics are written in the traditional TREC format: a short title and a sentence-length description. Systems are to return a ranked list of documents for each topic. Results will be pooled, and systems will be evaluated on a range of metrics.
The main task in the NeuCLIR track is ad hoc cross-language information retrieval. Systems will receive a document collection in Chinese, Persian, or Russian, and a set of English topics. For each topic, the system will return a ranked list of 1000 documents drawn from the document collection, ordered by likelihood of relevance to the topic.
The reranking task is an extension of the ad hoc task. In addition to a document collection and a set of topics, systems also receive as input for each topic a ranked list of 1000 documents drawn from the document collection. Each ranked list is the output of a retrieval system operating on the ad hoc task. Reranking task results must be drawn only from the documents that appear in these lists. In all other ways, the reranking task is identical to the ad hoc task. We will release runs for reranking for the HC4 training topics.
While monolingual retrieval is not a focus of the NeuCLIR track, monolingual runs can improve assessment pools, and serve as good points of reference for cross-language runs. The monolingual retrieval task is identical to the ad hoc task, but uses topic files that are human translations of the English topics into the target language in a way that would be expressed by speakers of the language.
Documents are distributed in JSONL format, with one document in JSON format on each line. The fields present for each document are:
id(string): The document ID for the document
text(string): Complete text of the document
date(string): YYYY-MM-DD or empty string
Lang(string): ISO 639-3
Participating teams can get a copy of the processed corpora directly from NIST. Alternatively, anybody can replicate the corpora by downloading them from the Common Crawl using provided scripts (this takes longer than downloading from NIST).
NeuCLIR will use the standard TREC ad hoc submission format for submissions and for ranked lists that serve as input to the reranking task. Each set of ranked results for a set of topics appears in a single file. Each line of this file contains six whitespace-separated entries:
Field 1 is the topic ID taken from the topics file. Entries for each topic must be contiguous in a ranked results file. Field 3 is the document ID, taken from the document collection. The scores in Field 5 must appear in non-increasing order within a given topic. In reranking input files, these values will be non-increasing, but will otherwise not be meaningful. Field 6 is the run ID, generated by the submitter (the first characters of the run ID must be the name of the submitting team). Fields 2 and 4 are ignored, although they must be present. Here is a portion of a sample ranked results file:
1 Q0 pid1 1 2.73 team1-run1 1 Q0 pid2 2 2.71 team1-run1 1 Q0 pid3 3 2.61 team1-run1 1 Q0 pid4 4 2.05 team1-run1 1 Q0 pid5 5 1.89 team1-run1
Up to 1,000 results per query will be accepted. Runs containing queries with more than 1,000 results will be rejected.
Relevance judgments are in standard TREC QRELS format. Each line of the QRELS file will contain four whitespace-separated entries:
3: document is fully relevant to the topic (i.e., it contains facts that would be included in lead paragraph of a report on the topic)
1: document is somewhat relevant to topic (i.e., it contains facts that would be included elsewhere in a report on the topic)
0: document is not relevant to topic (i.e., it does not contain information that would be included in a report written about the topic)
See the assessment section below for further explanation of these relevance levels. Qrels files will be distributed with development data but not evaluation data.
Development data, called HC4, include document sets of about 5 million Russian documents and ½ million each of Chinese and Persian documents. There are 60 development topics each in Chinese and Persian, and 54 development topics in Russian. Many of the HC4 documents are in NeuCLIR1. You will be able to score the development topics against the NeuCLIR1 collection if you intersect the two document sets, throw away retrieved documents not in HC4 and filter the qrels for documents in NeuCLIR1. HC4 is available from the ir_datasets package under hc4.
NEW: Details about our document collection, NEUCLIR1, are available on this page.
The document collections will be drawn from the Common Crawl News Collection, spanning a five year time window from August 2016 to July 2021. Very short and very long documents were filtered. Russian was randomly sampled to have 5 million documents pre-deduplication. Automatic deduplication removes duplicate documents. There are 4 ½ million Russian, 3 million Chinese, and 2 million Persian documents. Evaluation topics will be released in June 2022.
The download script for the collection is currently available in the NeuCLIR/download-collection repository. A gzipped version of the collection is available here (password will be available on the TREC website).
Patapsco is a CLIR framework that makes it easy to get started with CLIR. An overview of Patapsco’s capabilities can be found in this paper, and the code is available from the hltcoe/patapsco repository. A Jupyter notebook that steps through basic Patapsco usage is available here.
The Chinese NeuCLIR collection includes both traditional and simplified characters. For those who would like to deal with only one Chinese character set, a script to convert traditional to simplified characters is available in the NeuCLIR/download-collection repository. Users should bear in mind that the transliteration from traditional to simplified characters is imperfect, and there is no guarantee that the resulting text accurately captures the meaning of the text in the original traditional characters.
A validator for track submissions will be posted to the Active Participants area on trec.nist.gov.
Submissions will be evaluated using ir-measures. The software uses the official trec_eval implementation for measures that trec_eval (nDCG, MAP, etc.), and delegates computation of measures unsupported by trec_eval to alternative implementations (e.g., Rank Biased Precision uses cwl_eval). Evaluation will be conducted using the following command:
ir_measures qrels_file run_file \ 'nDCG@20 MAP RBP(rel=1) R@100 R@1000'
Runs will be submitted through the NIST submission system at trec.nist.gov/act_part/act_part.html. Runs that do not pass validation will be rejected outright. Submitted runs will be asked to specify (a) single-stage or multi-stage (i.e., ranking or reranking); (b) neural, statistical, or a combination; and (c) manual or automatic.
Each run submission must indicate whether the run is manual or automatic. An automatic run is any run that receives no human intervention once the system is started and provided with the task inputs. We expect most NeuCLIR runs to be automatic runs. Note that using the provided human and machine translations without further human intervention count as automatic runs. Runs that use the provided human translations (i.e., the monolingual task) will be reported separately from cross-language runs.
Results on manual runs will be specifically identified when results are reported. A manual run is any run in which changes are made to the queries, the system, or the system’s results after the topics have been seen. This includes, for example, manually creating queries from the topic description, or based on manual examination of retrieval results. or implementing new automated processing capabilities such as stop structure removal that are created after the topics have been seen. Simple bug fixes that address only format handling do not result in manual runs, but the changes should be described.
Relevance assessments for each topic will be made by a single person, who would optimally be the person who created the topic.
Scoring will be performed by ir-measures. The main reported measure will be:
We also solicit self-reported mean response time per query and the total number of model parameters for each run. This will allow us to analyze the efficiency of various approaches. We recognize that neither of these measures allows for a perfectly fair comparison (e.g., MRT is hardware-dependent and the number of parameters in a model does not correlate directly with its efficiency), so these measurements will only be grouped into coarse-grained categories. Reporting these values is optional, and good-faith approximations are permitted.
Our aim for NeuCLIR is to be a carbon-neutral track. Therefore, we strongly encourage all participants to track carbon emissions associated with their submission, and, if they are able to, buy carbon offsets to minimize the impact of their submissions. For details, please visit this page. Participants will be asked on submission whether they tracked their carbon emissions and whether they offset their impact. However, this will not affect the evaluation of individual runs; the information will be anonymized and will only be reported publicly in aggregate (e.g., “85% of teams tracked their carbon”).