The Long Story Short - DNA Data Storage: A Living Memory
Soon, you might have your favorite video or family photo stored on living DNA instead of magnetic tape/silicon chip.
(Note: related companies are at base of the story and all the sources listed in the "Balanced" section)
At present, the world is generating data at a much faster rate than what it takes to store it. The digital data universe is made up of things like images and videos on cellphones uploaded to YouTube, health records stored in hospital cloud systems, banking data accessed from ATMs, and many other forms of digital data. The size of the world’s digital data universe is doubling every two years, according to a recent study by International Data Corp., a Massachusetts-based market intelligence company. It is thought that current technologies such as magnetic tapes and silicon chips will not be enough to store the constantly expanding digital data universe.
There is an imminent need for a new, higher-density, less-costly, more-stable data storage solution. Because digital data is generated at a much faster pace than it could be stored, the world is running a data storage deficit equivalent to an estimated 7.5 zettabytes or 7.5 trillion gigabytes. This deficit corresponds to approximately 2 billion large hard drives worth of data or 260 billion iPhones. In 2025, pundits estimate that the world will generate 160 zettabytes of digital data. Experts believe the deficit will continue to grow until a new technological solution emerge, which is much needed as all silicon chips could be consumed by 2040.
The potential use of DNA to store digital data is very appealing. Scientists have long touted DNA’s potential as an ideal storage medium for digital data/information. The question is, can DNA replace magnetic tapes and silicon chips as the preferred data storage system? DNA is dense (a single gram of DNA can hold roughly a zettabyte), easy to replicate, and according to experts, it could be stable for more than 100,000 years. Nowadays, scientists have been able to encode all kinds of digital information into DNA sequences including the novel “War and Peace”, Deep Purple’s “Smoke on the Water” video, and a galloping horse GIF-film. However, to replace existing silicon-chip or magnetic-tape technologies, DNA-based systems will have to become significantly less costly and more predictable to read, write, and package.
DNA is the data storage system selected by nature, chosen by evolution to carry the genetic information which encodes life. It is for this reason scientists believe it could be utilized to readily store other types of data such as videos, audio, text and others. Computational scientists use the numbers 1 and 0 as a binary code for all the data stored in computer chips. In a similar fashion, scientists could use the four main chemical compounds in DNA — cytosine, adenine, thymine and guanine (C, A, T, G), as a code to store digital information.
Given that DNA is relatively dense compared to other materials, a large amount of data could be stored into a microscopic size molecule. Moreover, DNA is far more durable than magnetic-tapes and silicon-chips, which deteriorate in a few decades. Storing digital data into DNA nucleotide sequences requires the synthesis and preservation of DNA molecules in an adequate environment, which scientists believe could be accomplish with existing technologies.
Based on New Technological Advances, DNA Data Storage Could Become a Reality
Over the last 3.8 billion years, nature has optimized DNA as the data storage system of life. DNA presents an incredibly dense storage medium. A DNA molecule in a single human cell encodes roughly 24 gigabits of information, as much data as an hour-long HD video stream. The human body contains an estimated 37.2 trillion DNA molecules. With the advent of gene editing technologies such as CRISPR-Cas system, it is now possible to write, edit and store precise information into DNA molecules within living cells.
In 2017, the team of Dr. George M. Church at the Department of Genetics of Harvard University demonstrated what an excellent medium DNA is for data archival. Using gene editing technology, CRISPR-Cas system, the scientists were able to encode images and a short movie into the genomes of a population of living bacteria (Nature 2017, 547(7663)345-349). The results demonstrated that digital data could be stored within the DNA/genomes of populations of living cells. Pixels were stored into a nucleotide sequence code within DNA molecules. A flexible 21-color code was utilized to encode five frames of Eadweard Muybridge’s “Horse in Motion Film” at 36x26 pixels. Frames were each represented by unique nucleotide sequences in DNA molecules. Using this CRISPR-based technology, the authors believe it is possible to turn human cells like neurons into data storage systems. In theory, cells could gather information about themselves and store it into their DNA for data review by scientists later. Dr. Church’s team called this system “molecular ticker tape”.
The CRISPR-Cas gene editing system was discovered in bacteria, which use it as a defense mechanism against viral infection. Bacteria utilizes CRISPR-Cas system to incorporate viral DNA sequences into their own DNA molecule (bacterial genome). This allows the bacteria to later recognize and destroy that same virus. As bacteria encounter new viruses, it adds viral DNA sequences to their own bacterial DNA in the same chronological order in which they arrive. In theory, this allows for chronological data storage. Using this kind of living information recording system, scientists expect to detect which neurons were connecting to each other in response to various stimuli. In recent years, scientists studying DNA have moved from the “read-only” genomic era to the “read-and-write” era. DNA-writing technologies have the potential of transforming DNA into a dynamic tool for processing and storing data using living cells.
To read the DNA stored digital information, it is necessary to sequence the DNA molecule and decode it back into digital data. Sequencing the entire DNA molecule is laborious, time consuming and expensive. Scientists at Microsoft Research and University of Washington, Seattle, have developed a new technology that allows random access, to minimize the amount of sequencing required to recover the digital data stored within a DNA molecule (published in Nature Biotechnology in March 2018). Rather than sequencing the entire DNA molecule, they can selectively access the desired section containing the written information while sparing the rest. The technology is based on “PCR-based random access” to decipher only the pertinent information coded within a DNA molecule.In the future, industry experts believe using DNA as a data storage system would allow formatting every movie ever made into DNA, which will create a library smaller than a sugar cube lasting for at least 10,000 years. For now, we know for certain that Martin Luther King’s speech “I have a dream” and Shakespeare’s sonnets have been recorded into DNA molecules (Nature 2013, 494, 77-80).
Relatively Higher Costs and Technological Challenges May Act as an Insurmountable Barrier
Both DNA synthesis and sequencing are inherently highly error-prone, with mutation rates of approximately 0.01 errors/base. This constitutes a natural barrier for the use of these methods to store data. It could prevent the use of living cells as DNA storage systems, as the data’s fidelity becomes a concern due to relatively high mutation rates. Some experts predict that the need for increasing throughput and lowering costs will increase error rates in DNA-based systems to levels much higher than what have been observed with experimental prototypes today. They believe that if error rates increase, it will be difficult to successfully commercialize these technologies.
Another obstacle currently hindering the development of DNA-based archiving systems is the relatively high amount of synthetic DNA required to meet global demand for data storage. As it stands, the total amount of synthetic DNA available for this purpose will have to increase significantly. Despite recent technological advancements, it is still difficult to synthesize long sequences of DNA to an exactly-specified design.
In theory, DNA-based systems currently have a coding density of a few kilobytes per second throughput, which does not match existing storage technologies with throughputs of hundreds of megabits per second. Not all nucleotide sequences are equally effective in transferring digital data into a DNA molecule, which creates a challenge for selecting the best data encoding methods. Existing approaches to read DNA stored data rely on a high degree of sequencing redundancy (having many copies of DNA molecules for each sequence), which increases technological complexity.Furthermore, existing technologies are much cheaper than DNA data storage at present. Companies such as Twist Bioscience, which specializes in DNA synthesis, charge $0.07-$0.09 per nucleotide. The human genome has 3 billion nucleotide base pairs. Given current prices, a single minute of high-quality stereo sound could be stored for roughly $100,000. This is too costly compared to existing technologies. Recovering digital data from DNA will require more accurate, cheaper and speedier technologies to sequence DNA. Some scientists believe it will take too long before these technologies become mainstream.
New Technologies, Random Access in Large-Scale DNA Data Storage
The collaboration between Microsoft Research Team and scientists from the Computer Science and Engineering Department of University of Washington, Seattle, resulted in the creation of a large DNA library of high-definition video, images, audio and text including “the Universal Declaration of Human Rights” in over 100 languages, a high definition music video of the band “OK Go” among other data. This work was published on the journal Nature Biotechnology in March last year. The authors believe DNA data storage has the potential to replace magnetic tape for information storage.
Nowadays, most digital data, from music to satellite images to research files, are saved on magnetic tape. Tape is relatively inexpensive, but it takes up space, and it requires replacement approximately every ten years. Furthermore, magnetic-tape storage requires active, ongoing maintenance and regular transitioning between storage media, whereas DNA-based systems requires no active maintenance other than a cold, dry and dark environment. A large DNA data storage system or library will be likely kept in a similar way to the Global Crop Diversity Trust’s Svalbard Global Seed Vault (the world’s largest collection of crop diversity), which has no permanent on-site staff and it is estimated to remain viable for thousands of years.
An emerging competitor in the field of DNA data storage is Catalog, Inc., a Massachusetts Institute of Technology (MIT) spinoff, which aims to use DNA to store the world’s information. The privately-held Company is building a machine which uses 500 trillion molecules of DNA to write a terabyte of information per day. Catalog’s method can store 600 billion gigabytes in the same volume, which the Company believes could be very useful for film studios and particle-physics laboratories, which require large storage capabilities.
Catalog, Inc. has shown that its finished product resembles a thin, almost invisible film. To access the stored data, the DNA is mixed with water and put into a Sequencer that reads the CATG sequences, translating it back into digital data. Catalog believes that replacing silicon for DNA as data archiving solution may represent the future of the industry. To demonstrate the potential of its technology, Catalog’s team has stored Douglas Adams’ science fiction novel “The Hitchhiker’s Guide to the Galaxy” into a DNA-based data storage system. The startup technology company is currently backed by venture investors including New Enterprise Associates, OS Fund, Day One Ventures, Data Collective, and Green Bay Ventures.
Microsoft (MSFT) has said it hopes to deliver a commercial DNA-based data archiving system by the end of this decade. Catalog hopes to beat the tech giant by delivering its own commercial DNA-based system this year. Catalog’s management expects to reduce the costs of DNA systems to the same levels as magnetic tape. The Company plans to execute a beta launch of its DNA-based system to the intelligence and space agencies within the Federal Government, as well as the IT sector and Hollywood/entertainment industry.
Another competitor in the space is Iridia, a privately-held company based in San Diego, which was founded in 2016. The Company’s management is also aiming to develop the first commercial DNA-based data recording system. Iridia’s scientists are combining DNA polymer synthesis technology, electronic nano-switches and semiconductor technologies to store data at significantly higher density. Smaller companies, such as Iridia and Catalog, will compete for market share with larger companies such as Microsoft, Intel (INTC) and Micron Technology, Inc (MU). Industry leaders believe that innovation in DNA synthesis, DNA sequencing methods, and gene editing technologies could accelerate the development of commercial DNA data storage systems.
“Random Access in Large-Scale DNA Data Storage”, Nature Biotechnology 2018, 36(3)p242-248.
Scientists Upload a Galloping Horse GIF into Bacteria with CRISPR, Megan Molteni, Wired, July 12, 2017
What are genome editing and CRISPR-Cas9?, U.S. National Library of Medicine
Nature. 2017 Jul 20; 547(7663): 345–349.
“Emerging Applications for DNA Writers and Molecular Recorders”, Nature 2018, 361(6405)p870-875.
“Towards practical, high-capacity, low-maintenance information storage in synthesized DNA”, Nature 2013, 494, 77-80
Storing Data in DNA is a Lot Easier Than Getting it Back Out”, MIT Technology Review, January 26, 2018
High definition video of the Band “OK Go”, video was stored on DNA molecule (Nature Biotechnology 2018, 36(3)p242-248)The next big thing in data storage is actually microscopic, Hiawatha Bray, Boston Globe, June 26, 2018