Data Issues in Bioinformatics

Rajendra Raj
Department of Computer Science, RIT


The structure, storage and efficient retrieval of very large amounts of biological data has been identified to be one of the major problems in Bioinformatics. Bioinformatics data poses several interesting problems. First, it has traditionally been text-based and flat, thus amenable for human readability, which also makes the data unsuitable for machine processing. To make the data suitable for machine, flat files, relational databases, object databases and XML databases have been used, each of which offers some advantages and some drawbacks. Second, data sets from different sources need to be integrated for analysis, but the highly heterogeneous nature of these database schemas making data integration challenging. Finally, existing data in many bioinformatic databases is incomplete, inaccurate and unclean.

As a result of a FEAD grant, I made an initial attempt to understand some of the data organization and integration challenges of Bioinformatics data. In this talk, I report on the progress I have made on this journey.

