Skip to main content Skip to navigation

ARCH dataset

We present ARCH, a computational pathology (CP) multiple instance captioning dataset to facilitate dense supervision of CP tasks. Existing CP datasets focus on narrow tasks; ARCH on the other hand contains dense diagnostic and morphological descriptions for a range of stains, tissue types and pathologies. Using intrinsic dimensionality estimation, we show that ARCH is the only CP dataset to (ARCH-)rival its computer vision analog MS-COCO Captions. We conjecture that an encoder pre-trained on dense image captions learns transferable representations for most CP tasks. We support the conjecture with evidence that ARCH representation transfers to a variety of pathology sub-tasks better than ImageNet features or representations obtained via self-supervised or multi-task learning on pathology images alone. We release our best model and invite other researchers to test it on their CP tasks.


  title={Multiple Instance Captioning: Learning Representations from 
Histopathology Textbooks and Articles}, author={Gamper, Jevgenij and Rajpoot, Nasir}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition}, year={2021} }

Link to the CVPR 2021 paper

Dataset Usage Rules

  1. The dataset provided here is for research purposes only. Commercial uses are not allowed. The data is licensed under the following license

    Attribution-NonCommercial-ShareAlike 4.0 International

    Creative Commons License

  2. If you intend to publish research work that uses this dataset, you must cite our papers (as mentioned above), wherein the same dataset was first used.


Please download the dataset from this link: book_set; pubmed_set Creative Commons License


There is a disparity between the number of samples within the paper and the dataset available for download due to an error.