Chem-Infty Dataset: A Ground-Truthed Dataset of Chemical Structure Images

Contact Authors:
Koji Nakagawa(kn@kyudai.jp), Faculty of Mathematics, Kyushu University, JAPAN
Akio Fujiyoshi(fujiyosi@mx.ibaraki.ac.jp), Department of Computer and Information Sciences, Ibaraki University, JAPAN
Masakazu Suzuki(suzuki@math.kyushu-u.ac.jp), Faculty of Mathematics, Kyushu University, JAPAN

Keywords:
Optical Chemical Structure Recognition

Description:
This dataset consists of chemical images (data) and their chemical meaning (meta data). The 5727 chemical images were randomly collected from Japanese published patent applications in the year 2008. The meta data are represented by the MDL SDF (Structure Data Format) format, which is one of CTfile (Chemical Table file) formats.

This work is licensed under a 
Creative Commons Attribution-Noncommercial-No Derivative Works 2.1 Japan License
<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/2.1/jp/">
<img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-nc-nd/2.1/jp/88x31.png" />
</a><br />

When you use or distribute this dataset, please tell us your contact information (Name, Affiliation, E-mail address) so that we can tell you up-to-date information.

Although we tried our best to make the dataset correct, there may be some incorrect data. When you find it, please report it to us so that the dataset can be updated.

Technical Information:
- Files formats:	
  Images: TIFF format images including binary and greyscale.
  Ground Truth Specification: SDF format 
  The specification of the SDF format can be downloaded from here: http://www.symyx.com/solutions/white_papers/ctfile_formats.jsp
- File Name Convention:
 The file names of image files and the meta data have the following name convention:
  2008XXXXXX_N_chem.tif: a TIFF file
  2008XXXXXX_N_chem.sdf: the meta data of 2008XXXXXX_NNN_chem.tif
 The string '2008XXXXXX' expresses the patent ID and 'N' expresses the ‘N’-th elements of the multi-tiff file (See the References 1).
