한국인 유전체 분석개요
개인 유전체 분석개요 : Personal Genome Analysis Outline
Solexa's next generation sequencer was used for this project. This sequencer takes DNA fragments broken into billions of pieces. The sequences read from the fragments are fed into computers to produce an assembled or aligned genomic sequences. We produced 640 million fragments. The total length was 22.4 billion base pairs. This is about 8 times of the one human genome that is about 3 billion bases. It will take up around 23 gigabyte storage space in a computer disk. The reason why it is 8 times is that modern sequencers are not perfect and cannot read a whole length DNA genome in one pass. Therefore, it is necessary to produce redundant DNA sequences to be aligned and picked for high quality DNA regions.
<해석된 DNA 조각> : <Aligned DNA fragment from the sequencers>
KOBIC took the analysis part work in the Korean Personal Genome Project. KOBIC has been maintaining very large scale computing resources to process the personal genome information. The raw DNA sequence data from the sequencers are processed by various bioinformatic programs to generate information on sequencing accuracy, comparative genomic data, personal variation information, and SNP information associated diseases. We used the publicized Chinese personal genome information and two caucasian sequence information.
|한국인 유전체 서열 해석 및 분석 결과: Korean Genome Sequence analysis results|
|해석된 총 DNA 염기쌍 수: Total DNA base pair sequenced||
확인된 총 DNA 염기 수
Verified DNA bases
표준 인간 유전체 일치율
Coverage compared to NCBI reference genome
대용량 개인유전체 분석: Large scale personal genome analysis
NCBI in the USA provides the reference genome information using the first Caucasian genome project completed in 2003. Comparing the NCBI's genome with KSJ Korean genome, we found that 20,700,000,000 out of 22,409,000,000 bases were matched. This is about 98.35% identity. In the comparison, the threshold for the match was up to two non-matching bases in 35 base sequence fragment. Another criterion was four or more fragment on one DNA region was accepted and removed fragment region that have more than 50 fragments matched. Too many fragments matching means that the region is a repeat region. From this quality checking, we discovered that 3.23 million SNP candidates. This is about 0.1% of the total human genome variation. In other words, it can be said that one person can have around 3.23 million variations. This number is the most fundamental difference among people. As time goes by, the distinct variation number will change as there will be more common variations and novel variations. In the Chinese genome, they reported around 3 million SNP candidates. The reason why we have discovered more than 3 million SNP candidates is that Korean genomic and SNP information has not been uploaded in large scales before.
In the cross comparison with two Caucasian and the Chinese genomes, we found that Korean genome is an outsider due to the previous SNP registration. Chinese SNPs have been deposited through the internation HapMap project while Koreans did not participate in the project. KSJ genome shared 1.15 million SNPs with the Chinese and Korean specific SNPs were over 2 million altough the Chinese and the Korean are very similar ethnically. Compared to James Watson's, KSJ genome shared 720,000 SNPs and Korean specific ones were 2.5 million. With Craig Venter, KSJ genome had 920,000 shared and 2.3 million Korean specific SNPs. All the four genomes shared 570,000 SNPs and the Korean specific ones from the comparison was around 1.8 million. Among these, we found 1.5 million new SNP candidates that are not found in common SNP databases. This is about 0.06% of the whole human genome. This indicates that although there are ethnically similar genomes available, it is necessary to sequence the ethnic specific genomes to further discover new SNPs.
<유전체서열비교를 통한 SNP 공유 정도>: <Common SNPs between personal genomes>