Biomedical multilingual corpora

The BioMedical Corpora includes:

  1. A comparable corpus of monolingual texts (CS and EN indexed biomedical data),
  2. Some parallel corpora (aligned data).

The first one (1) is a dataset that contains indexed monolingual documents from several European languages with terms from existing multi-lingual medical taxonomies and vocabularies (such as MeSH and other sources within UMLS). For now, the following Czech medical documents are indexed:

  • Bibliographia Medica Čechoslovaca (BMČ),
  • MeSH in Czech
  • And the following English medical documents are indexed:
  • ClinicalTrials.gov
  • Cochrane
  • drug information web sites
  • DrugBank
  • Genetics Home Reference
  • HON classified diabetes web sites
  • ImageCLEF 2010
  • MEDLINE abstracts
  • UMLS

These corpora are used in the subsequent tasks (4.1b, 4.2. etc.) also for the development of language models and translation systems.

The second one (2) contains automatically aligned parallel document corpora for the biomedical domain for the purpose of improving machine translation systems.

Language pairs: EN-FR, EN-CS, EN-DE.

Additional resources for other languages will be added.

(see Task 4.1a Multi-lingual biomedical information extraction and indexing, Task 4.1b Applying other language models to information extraction, Task 4.2 Alignment for parallel corpora of biomedical documents, and Task 4.6 Machine translation).