Researchers at the University of Washington have found an alarming number of errors in the publicly available genomic data as they held a large-scale analysis of protein sequences.
The work, published in the journal Frontiers in Microbiology, The most cited journal Microbiology in the world, could have important implications for the future of genome research.
An interdisciplinary team of scientists initially set out to find evidence of a minimum set of proteins, which needs Proteobacteria survival. Their data set consisted of almost nine million protein sequences grouped by similarity to more than 2,300 bacterial genomes.
Genome complete set of genes in a cell or organism, and genes provide instructions for making proteins, which make up all organisms.
As they were looking through a massive set of data for four specific proteins are considered part of the minimum genome for Proteobacteria, they found that only one of the four proteins, they were looking for were separated all the bacteria. They also found a large number of errors in the publicly available data.
"We found that for each of the proteins have been mistakes in the annotation of genes, resulting in the reduced or lack of consistency," said Shira Broschat, a professor in the School of Electrical and Computer Engineering.
Huge on & # 39; the amount of data created by the sequencing of the next generation of technologies make the type of error annotations WSU team found particularly problematic, says Svetlana Lockwood, lead author on the paper and a Ph.D. a graduate of Computer Science from WSU.
"One annotation error can spread rapidly, because scientists are building on the previous annotations when they consistently new gene," she said.
While it took 13 years and 2.7 billion to $ sequencing of the human genome within the human genome project in 2003 that the same work can now be done in one hour is less than $ 1,500.
"Over the past two years alone, the researchers sequenced the more than twice the number of bacterial genomes, as they did twenty years before," said Broschat.
Despite the fact that this is not the first to note the presence document annotation errors, worklists WSU team and explains the different types of annotation errors, which are now found in the genomic sequencing data.
"With the scale of incorrect summary, we found the researchers need to reconsider the reliability of public genome database for large data applications," said Broschat.
According to Kelly Brayton, professor of the department of veterinary microbiology and pathology, caused by human error and technological factors. Errors often occur due to imperfect technology of DNA sequencing, which provides information about the basic pairs of DNA segments. They may also occur due to confusion and lack of knowledge about proteins as well.
The team used the software embedded and high-performance computing clusters on the PNNL campus to work on its own set of data, the largest of its kind analyzed to date. Data were collected from a database provided by the National Center for Biotechnology Information, part of the US National Library of Medicine, the world's largest medical library, and work was funded by the National Science Foundation.
Broschat and Brighton are now working on a tool to find errors in the annotation of biological data sets, which would be very useful for those who work in the field of life sciences.
Universal basis of & # 39; unification of genome annotation and higher education
Svetlana Lockwood and others Total Proteus 2307 clustering Proteobacterial genomes disclose conserved proteins and mean Annotations, Frontiers in Microbiology (2019). DOI: 10,3389 / fmicb.2019.00383
Against the background of the explosion of genomic data, scientists are proliferating error (2019 April 30)
received April 30, 2019
This document & # 39 is on & # 39; subject to copyright. Except in the case of fair use for the purpose of private study or research, NO
part may be reproduced without written permission. Contents provided for informational purposes only.