Overview

Access to high-quality human genomics data can be a major barrier for bioinformatics learners, especially for those looking to practice aligning raw sequencing data. I've heard from a few different colleagues trying to practice with real-world human data, so I wanted to write a quick post plugging the 1000 genomes project as a source of practice data. In this guide, I'll walk you through how to procure 1k genoems files, use them to generate an aligned BAM, and view the output in the Integrated Genomics Viewer (IGV).

1000 Genomes

There are have been many publications from the 1k genomes project over the past decade, but I'll focus primarily on this paper, from 2015, which describes the phase 3 data release:

A global reference for human genetic variation
Nature - The 1000 Genomes Project has sought to comprehensively catalogue human genetic variation across populations, providing a valuable public genomic resource. The data obtained so far have...
https://www.nature.com/articles/nature15393

This release includes 2,504 genomes sequenced using low-coverage whole genome sequencing and targeted exome sequencing. The authors also provide access to the raw FASTQ files on the project portal. There is a great deal that you can do with this sort of data, but I wanted to focus on one of the simplest bioinformatics use cases first: generating an aligned BAM from a FASTQ file. To begin, we can first navigate to the project portal at https://www.internationalgenome.org/data-portal/data-collection/phase-3 and select any of the available samples, such as HG00152, to view the associated files. We can also scroll down the page to view the entire list of phase 3 .sra (Sequence Read Archive) files. Any of the .sra files are suitable for practicing basic alignments, we just need to download a paired set of FASTQ files.

Data portal | 1000 Genomes
The International Genome Sample Resource (IGSR) has been established at EMBL-EBI to continue supporting data generated by the 1000 Genomes Project, supplemented with new data and new analysis.
https://www.internationalgenome.org/data-portal/data-collection/phase-3

Using individual HG00152 as an example, we have 3 pairs of WGS FASTQ files to select from. Each pair is formatted with the naming format {accession}_1.fastq.gz and {accession}_2.fastq.gz representing the pairs. Why are there multiple pairs to choose from? From the FASTQ section of the 1000 genomes page:

"When an individual has many files with different run accessions, this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms."

Any accession number should work for generating an alignment, so let's pick the ERR015526 files to download. Any learners wanting more metadata on this particular sample can check out the "20130502.phase3.sequence.index" file on the 1000 genomes FTP site (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/).

Workspace Setup

Once you've downloaded the files, we can proceed with the alignment. We'll be using BWA for the alignment (github) but I don't really feel like installing it locally (as a Mac user, it's somewhat common for me to run into issues running bioinformatics software that was developed for linux users). Instead, I'll grab a suitable Docker image from Dockerhub. The BWA Docker image maintained by the State Public Health Bioinformatics Workgroup on Dockerhub looks appropriate to me. We can obtain the image a quick docker pull staphb/bwa:latest (users with a Mac M1 may need to add --platform linux/amd64 to the pull command). We'll also need to procure a reference genome sequence to align against. I'll use the GRCh38 reference genome primary assembly from Ensembl, which I like because it represents a relatively lightweight reference sequence focused on the protein-coding regions of the genome. We can run a simple wget command to download this sequence from the Ensembl FTP site:

wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Now that we've procured our FASTQ files, the Docker image, and our reference genome, we can proceed with running the Docker image and generating the alignment!

The Alignment

The first step is to start our BWA Docker container.

# Run docker container
docker run \
    --rm \
    -v /path/to/local/data:/mnt \
    staphb/bwa

We've added two additional parameters to this Docker command:

  • --rm removes the container once you exit it

  • -v mounts a local directory (substitute your own path) where we can stage our files

Once you step into the Docker container, you can navigate to the mounted directory (cd /mnt). There are several algorithmic alignments that the BWA library can run, but the standard for short read sequencing is the maximal exact matches (MEM) approach designed for Illumina sequence reads up to 100bp. Below is an outline of the basic commands we need to run:

# Index FASTA file
bwa index Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Run BWA MEM alignment
bwa mem \
  /mnt/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  /mnt/ERR015526_1.fastq.gz \
  /mnt/ERR015526_2.fastq.gz \
  > aligned_output.sam

If we want to run everything from outside of our Docker container, we can also chain the run command and subsequent BWA calls into a single command from outside of the Docker container:

docker run --rm \
  -v /path/to/local/data:/mnt \
  staphb/bwa \
  bwa index /mnt/Homo_sapiens.GRCh38.dna.primary_assembly.fa && \
  bwa mem \
  /mnt/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  /mnt/ERR015526_1.fastq.gz \
  /mnt/ERR015526_2.fastq.gz > /mnt/aligned_output.sam

On my Macbook M1, this alignment takes a very long time (several hours). Running Docker containers typically are allocated about half of the available system resources, so there are plenty of options for accelerating the BWA alignment, especially if users want to run the alignment on a GPU-capable system. However, I'll leave that discussion for another post.

The last thing we need to do is convert our SAM file into a BAM, which we actually can't do using the BWA docker container (it exclusively contains a BWA installation, no other bioinformatics libraries). There are many different ways to install samtools (Mac users can brew install it, and samtools is also available on the Anaconda cloud, as just a few options). Users wishing to run a samtools Docker image like we did for BWA can procure a samtools image from Dockerhub. There are many suitable images, but I would opt to use the samtools biocontainers image.

Once we have a working copy of samtools, we'll need to sort and index the BAM file to use it in any meaningful way. We can just do all 3 operations at once with a succinct set of pipes:


samtools view -bS aligned_output.sam | samtools sort -o ERR015526.bam && samtools index ERR015526.bam 

This process takes a few minutes on my terminal, and results in compressed BAM file we can explore with IGV.

Using the BAM File

There are many ways to use the generated alignment information, many of which are outside the scope of our post. But as an initial test of whether the BAM alignment looks "normal" we can view some of the aligned regions in the Integrated Genomics Viewer. What to view? We can start by examining a classic locus involved in cancer onset and progression, the BRCA1 gene. This gene is on chromosome 17 and spans the nucleotide positions 43,044,292 - 43,170,327, but we can examine exon 19 specifically. This region is critical for normal BRCA1 function, and several confirmed pathogenic BRCA1 mutations map to this exon. One example, from ClinVar, is a Phe>Ser amino acid change at hg38 genomic position chr17:43051113. If we navigate to this position in IGV, we can see that subject HG00152 is completely wild type for this exon (with the caveat that overall coverage of this regions is relatively low - a clinical assay would have much more targeted coverage of this area).

These results are expected: BRCA1 mutations are generally rare, and the 1k genomes project was intended to sequence healthy individuals for population genomics uses.

Conclusion

That concludes the discussion of the 1,000 genomes project! I hope this has been a useful exercise which has demystified a key source of population genomics data over the last decade. There are so many other things you can do with this data which are outside of the scope of this post, but I hope that this will serve as a good starting point for anyone hoping to acclimate themselves to human genomics data.