The following table shows some of the common types of data files and includes some information about them:
File Extension | Type of Data | Format | Example(s) | ||
---|---|---|---|---|---|
txt | multi-format | Text | study metadata, tab-delimited data | txt | |
fastq | nucleotide | Text | sequencing reads | fastq | |
fasta | nucleotide, protein | Text | the human genome | fasta | |
sff | nucleotide | Binary | Roche/454 sequencing data | sff | |
vcf | multi-format | Text | variation/SNP calls | ||
sam | alignment | Text | reads aligned to a reference | sam | |
bam | alignment | Binary | reads aligned to a reference | bam | |
bed | metadata / feature definitions | Binary | genome coverage | bed | |
h5 | binary hierarchical | Binary | PacBio sequencing data | h5 | |
pileup | alignment | Text | mpileup, SNP and indel calling | pileup |
In this workshop, there are a few bioinformatics-related data types we will focus on (beyond simple text files - although in principle many of the files are text). First let’s consider the definition/documentation for these file types:
Plain-text
Compressed/binary
In addition to understanding how to work with these files, we also need to understand how to verify the integrity of these files. It is not uncommon to download a file, and get error messages, have to restart downloads, move files, etc. In these cases, knowing how to verify that two files are the same (not simply named the same) is very important. To do this we use a process called checksums:
According to wikipedia a checksum is ‘a small-size datum from a block of digital data for the purpose of detecting errors which may have been introduced during its transmission or storage.’In other words it is a result we can generate that uniquely corresponds to a file. Any change to that file (adding a space, deleting a character, etc.) will change the file’s checksum.
Use wget
to download one ‘test’ genomes from the following url:
$ wget http://de.iplantcollaborative.org/dl/d/6E4E9943-93F8-4136-86E3-14DA6D1B604F/GCF_000017985.1_ASM1798v1_genomic_2.fna
Next for each file use md5sum and compare the results
$ md5sum <file>
wget
command to download the contents of the ftp site (don’t forget to use the ‘*’ wildcard to download all files)
$ wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-27/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/dna/*
You should have downloaded the following files:
CHECKSUMS
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.chromosome.Chromosome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_rm.chromosome.Chromosome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_rm.genome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_rm.toplevel.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_sm.chromosome.Chromosome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_sm.genome.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna_sm.toplevel.fa.gz
Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.toplevel.fa.gz
README
Use the less
command to examine the README file - in particular, look at the
Generate a checksum using the sum
command(sum
is used by Enseml and is an alternative to md5sum
) for the ‘Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz’ file and compare with the last few digits of the sum displayed in the CHECKSUMS file.
Preview the first few lines (head
) of the compressed (gzip’d) reference genome using the zcat
command:
$ zcat Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz |head
Finally, unzip the E.coli reference genome, which will be part of the variant calling lesson and place it in your ‘data folder’ in a new ‘ref_genome’ folder.
$ gzip -d Escherichia_coli_b_str_rel606.GCA_000017985.1.27.dna.genome.fa.gz
Tip: create the ‘ref_genome’ folder in ‘~/dc_workshop/data’ and use the cp command to move the data