This is a DRAFT. Please send comments to #data-coord on the VGP Slack workspace.

Data Structure

All data for Genomeark is stored in an AWS S3 bucket. The data can be browsed using a web browser here. This page describes how the data are organized. Note that AWS S3 is “object storage” and does not contain hierarchical directories or folders like traditional file systems. However, the structure will be discussed as if the data will be organized in directories for convenience. The following is the current file tree structure:

/
└── species/
    └── {Genus_species}/
        └── {ToLID}/
            ├── assembly_{pipeline}_{ver}/
            ├── assembly_curated/
            ├── assembly_MT_{institution}/
            ├── genomic_data/
            │   ├── arima/
            │   ├── bionano/
            │   ├── dovetail/
            │   ├── element/
            │   ├── illumina/
            │   ├── ont_duplex/
            │   ├── ont/
            │   ├── pacbio_hifi/
            │   ├── pacbio/
            │   └── 10x/
            └── transcriptomic_data/
                └── {tissue_id}/
                    ├── illumina/
                    └── pacbio/

Once we visit a project (i.e., ToLID directory), the top-level directory is expected to have:

genomic_data
transcriptomic_data
assembly_{pipeline}_{ver}
assembly_curated
assembly_MT_{institution}

Genomic Data

CCS/HiFi data from Pacific Biosciences

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── pacbio_hifi/
                      ├── {movie}.subreads.bam      # optional, but recommended
                      ├── {movie}.subreads.bam.pbi  # optional, but recommended
                      ├── {movie}.hifi_reads.bam
                      ├── {movie}.hifi_reads.fastq.gz
                      ├── {movie}.hifi_reads.5mC.bam
                      ├── {movie}.deepconsensus.bam
                      ├── README
                      └── files.md5

Please see these notes about kinetics and methylation tags in the hifi_reads.bam file.

Many often provide fastq (gzipped) in addition to the BAMs, despite how wasteful that is on space, for convenience — especially during the analysis phase of the project. Keeping the subreads BAM files is helpful for calling bases with DeepConsensus later, especially if it hasn’t already been done.

Hi-C from Arima Genomics

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── arima/
                      ├── {prefix}_{runID}_R1.fastq.gz
                      ├── {prefix}_{runID}_R2.fastq.gz
                      ├── re_bases.txt
                      ├── README
                      └── files.md5

Unmapped BAM/CRAM files can be provided instead of fastq. re_bases.txt is a legacy file with the restriction enzyme cutting sites that the kit used. The current VGP pipeline instead asks for the kit and not the enzyme sites, but some older datasets will have this file.

dovetail (Hi-C and/or Omni-C from Dovetail Genomics)

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── dovetail/
                      ├── {prefix}_{runID}_R1.fastq.gz
                      ├── {prefix}_{runID}_R2.fastq.gz
                      ├── re_bases.txt
                      ├── README
                      └── files.md5

Whole Genome Shotgun from Illumina

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── illumina/
                      ├── {prefix}_{runID}_R1.fastq.gz
                      ├── {prefix}_{runID}_R2.fastq.gz
                      ├── README
                      └── files.md5

Unmapped BAM/CRAM files can be provided instead of fastq.

Legacy Genomic Data

Older projects can also include the following legacy data.

Irys/Saphyr optical mapping from BioNano Genomics

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── bionano/
                      ├── {prefix}_{platform}_{enzyme}[_{jobid}].bnx.gz
                      ├── {prefix}_{platform}_{enzyme}.cmap.gz
                      ├── README
                      └── files.md5

Simplex and/or duplex with any pore or chemistry from Oxford Nanopore Technologies

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  ├── ont/
                  │   ├── {flowcell_id}_{basecaller}-{version}.bam
                  │   ├── {flowcell_id}_{basecaller}-{version}.fastq.gz
                  │   ├── README
                  │   ├── files.md5
                  │   └── fast5/     (optional, but recommended)
                  │       └── {flowcell_id}
                  │           ├── {prefix}.fast5
                  │           └── files.md5
                  └── ont_duplex/
                      ├── {flowcell_id}_{basecaller}-{version}.bam
                      ├── {flowcell_id}_{basecaller}-{version}.fastq.gz
                      ├── README
                      ├── files.md5
                      └── fast5/     (optional, but recommended)
                          └── {flowcell_id}
                              ├── {prefix}.fast5
                              └── files.md5

ONT data can be tricky to keep organized because there is such variation between runs, depending on choice of machine, pore, library prep, chemistry, etc. Most of this information will just be metadata that you should include in a README. You are encouraged to call methylation too, please note in the README what is and isn’t available in each file. Keeping the fast5s is also encouraged for future re-calling of bases and/or methylation.

Linked-reads from 10X Genomics

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── 10x/
                      ├── {runID}_S1_L001_I1_001.fastq.gz
                      ├── {runID}_S1_L001_R1_001.fastq.gz
                      ├── {runID}_S1_L001_R2_001.fastq.gz
                      ├── README
                      └── files.md5

CLR data from Pacific Biosciences

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── genomic_data/
                  └── pacbio/
                      ├── {movie}.subreads.bam 
                      ├── {movie}.subreads.bam.pbi
                      ├── {movie}.scraps.bam
                      ├── README
                      └── files.md5

The ‘scraps’ have been purged from GenomeArk.

Transcriptomic Data

RNA-seq from Illumina

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── transcriptomic_data/
                  └── illumina/
                      └── {tissue_type}
                          ├── {prefix}_{runID}_R1.fastq.gz
                          ├── {prefix}_{runID}_R2.fastq.gz
                          ├── README
                          └── files.md5

ISO-seq from Pacific Biosciences

Projects use one of the two structures below:

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── transcriptomic_data/
                  └── pacbio/
                      └── {tissue_type}
                          ├── {ToLID}_{tissue_type}_flnc.bam
                          ├── {ToLID}_{tissue_type}_flnc.report.csv
                          ├── {ToLID}_{tissue_type}_hq_transcripts.fasta
                          ├── README
                          └── files.md5

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── transcriptomic_data/
                  └── pacbio/
                      └── {tissue_type}
                          ├── {movie}.subreads.bam
                          ├── {movie}.subreads.bam.pbi
                          ├── {movie}.subreadset.xml
                          ├── {prefix}_hq_isoforms_{#}.{#}.bam    # processed iso-seq file
                          ├── README
                          └── files.md5

Assemblies

Final Curated Assembly

Hi-C

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── assembly_curated/
                  ├── {genome_id}.hap1.cur.YYYYMMDD.chromosomes.csv        Chromosome .csv (haplotype 1)
                  ├── {genome_id}.hap1.cur.YYYYMMDD.fasta.gz               Final curated assembly (haplotype 1)
                  ├── {genome_id}.hap1.cur.YYYYMMDD.pretext                Pretext map (haplotype 1)
                  ├── {genome_id}.hap2.cur.YYYYMMDD.chromosomes.csv        Chromosome .csv (haplotype 2)
                  ├── {genome_id}.hap2.cur.YYYYMMDD.fasta.gz               Final curated assembly (haplotype 2)
                  ├── {genome_id}.hap2.cur.YYYYMMDD.pretext                Pretext map (haplotype 2)

Primary and alternate

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── assembly_curated/
                  ├── {genome_id}.alt.cur.YYYYMMDD.fasta.gz                Final curated assembly (alternate)
                  ├── {genome_id}.pri.cur.YYYYMMDD.chromosomes.csv         Chromosome .csv (primary)
                  ├── {genome_id}.pri.cur.YYYYMMDD.fasta.gz                Final curated assembly (primary)
                  ├── {genome_id}.pri.cur.YYYYMMDD.pretext                 Pretext map (primary)
                  └── intermediates/                                       Can contain files generated from draft assemblies to be used in curation        

It is acceptable to name the directory assembly_curated_{suffix}, where the suffix is an underscore followed by some informative string. This can be useful when more than 1 assemblies were curated so that they can be easily easily differentiated from each other. Examples of informative strings are the location or institution that generated the assembly, a version, or a date.

The following specifications for assembly folders apply to the VGP pipeline version 2.0 assemblies. For more information about version 1.0 assemblies, please see the previous documentation.

Uncurated Assemblies

vgp_standard_2.0 (primary/alternate)

  └── species
    └── {Genus_species}
        └── {ToLID}
            └── assembly_{pipeline}_{ver}
                ├── evaluation
                │   ├── busco
                │   │   ├── c
                │   │   │   └── {genome_id}_busco_[full_table.tab,missing_buscos.tab,busco_image.png,short_summary.txt]
                │   │   └── s1
                │   │       └── {genome_id}_busco_[full_table.tab,missing_buscos.tab,busco_image.png,short_summary.txt]
                │   ├── genomescope
                │   │   ├── {genome_id}_genomescope__[Linear,Log,Transformed_Linear,Transformed_Log]_Plot.png
                │   │   ├── {genome_id}_genomescope__[Model,Summary].txt
                │   │   └── {genome_id}_genomescope__Model_parameters.tsv
                │   ├── gfastats
                │   │   ├── c
                │   │   │   ├── {genome_id}_alt.tab
                │   │   │   └── {genome_id}_prim.tab
                │   │   ├── s1
                │   │   │   └── {genome_id}_.tab
                │   │   └── s2
                │   │       └── {genome_id}.tab
                │   ├── merqury
                │   │   ├── {genome_id}_png
                │   │   │   ├── output_merqury.assembly_[01,02].spectra-cn.[fl,ln,st].png
                │   │   │   ├── output_merqury.spectra-asm.[fl,ln,st].png
                │   │   │   └── output_merqury.spectra-cn.[fl,ln,st].png
                │   │   ├── {genome_id}_qv
                │   │   │   ├── otput_merqury.assembly_[01,02].tabular
                │   │   │   └── otput_merqury.tabular
                │   │   └── {genome_id}_stats
                │   │       └── output_merqury.completeness.tabular
                │   └── pretext
                │       ├── {genome_id}__s1.bed
                │       ├── {genome_id}__s1.heatmap.[png,pretext]
                │       ├── {genome_id}__s2.bed
                │       └── {genome_id}__s2.heatmap.[png,pretext]
                ├── intermediates
                │   ├── bionano
                │   │   ├── {genome_id}_NGS_contigs_not_scaffolded_NCBI_trimmed.fasta.gz
                │   │   ├── {genome_id}_NGS_contigs_scaffold_NCBI_trimmed.fasta.gz
                │   │   ├── {genome_id}_conflicts.txt
                │   │   ├── {genome_id}_hybrid_scaffold_report.txt
                │   │   └── {genome_id}_s1_AGP.agp
                │   ├── hifiasm
                │   │   ├── {genome_id}_alternate_assembly_contig_graph.gfa.gz
                │   │   ├── {genome_id}_haplotype_resolved_processed_unitig_graph.gfa.gz
                │   │   ├── {genome_id}_haplotype_resolved_raw_unitig_graph.gfa.gz
                │   │   └── {genome_id}_primary_assembly_contig_graph.gfa.gz
                │   ├── {genome_id}_c1.fasta.gz
                │   ├── {genome_id}_c2.fasta.gz
                │   ├── {genome_id}_p1.fasta.gz
                │   ├── {genome_id}_q2.fasta.gz
                │   ├── {genome_id}_s1.fasta.gz
                │   ├── meryl
                │   │   └── {genome_id}_.meryldb.tar.gz
                │   └── yahs
                │       ├── {genome_id}_{genome_id}_s2.agp
                │       └── {genome_id}_{genome_id}_s2.l og
                ├── {genome_id}.standard.alt.YYYYMMDD.fasta.gz
                ├── {genome_id}.standard.pri.YYYYMMDD.fasta.gz
                └── {genome_id}.yml

vgp_HiC_2.0 (hap1/hap2)

  └── species
    └── {Genus_species}
        └── {ToLID}
            └── assembly_vgp_HiC_2.0
                ├── {genome_id}.HiC.hap1.YYYYMMDD.fasta.gz                              Scaffolded draft assembly of hap1 that goes to curation
                ├── {genome_id}.HiC.hap2.YYYYMMDD.fasta.gz                              Scaffolded draft assembly of hap2 that goes to curation
                ├── {genome_id}.yml                                                     Assembly metadata file that is used for curation submission to Sanger
                ├── evaluation
                │   ├── busco
                │   │   └── c
                │   │       └── {genome_id}_HiC__busco_[hap1/hap2]_[full_table.tab,missing_buscos.tab,busco_image.png,short_summary.txt]
                │   ├── genomescope                                                     Folder has same content as described in the primary/alternate guide
                │   ├── gfastats
                │   │   └── c
                │   │       ├── {genome_id}__hap1.tab
                │   │       └── {genome_id}__hap2.tab
                │   ├── hap1
                │   │   ├── busco
                │   │   │   └── s1
                │   │   │       └── {genome_id}__[full_table.tab,missing_buscos.tab,busco_image.png,short_summary.txt]
                │   │   ├── gfastats
                │   │   │   ├── s1
                │   │   │   │   └── {genome_id}.tab
                │   │   │   └── s2
                │   │   │       └── {genome_id}.tab
                │   │   └── pretext
                │   │       ├── {genome_id}_hap1__[s1,s2].bed
                            └── {genome_id}_hap1__[s1,s2]_heatmap.[png,pretext]
                │   ├── hap2
                │   │   ├── busco
                │   │   │   └── s1
                │   │   │       └── {genome_id}__[full_table.tab,missing_buscos.tab,busco_image.png,short_summary.txt]
                │   │   ├── gfastats
                │   │   │   ├── s1
                │   │   │   │   └── {genome_id}.tab
                │   │   │   └── s2
                │   │   │       └── {genome_id}.tab
                │   │   └── pretext
                │   │       ├── {genome_id}_hap2__[s1,s2].bed
                            └── {genome_id}_hap2__[s1,s2]_heatmap.[png,pretext]
                │   ├── merqury
                │   │   ├── {genome_id}_png                                             Folder has same content as described in the primary/alternate guide
                │   │   ├── {genome_id}_qv                                              Folder has same content as described in the primary/alternate guide
                │   │   └── {genome_id}_stats                                           Folder has same content as described in the primary/alternate guide
                └── intermediates
                    ├── hap1
                    │   ├── bionano
                    │   │   ├── {genome_id}_NGS_contigs_scaffold_NCBI_trimmed.fasta.gz
                    │   │   ├── {genome_id}_NGS_contigs_not_scaffolded_NCBI_trimmed.fasta.gz
                    │   │   ├── {genome_id}_conflicts.txt
                    │   │   ├── {genome_id}_hybrid_scaffold_report.txt
                    │   │   └── {genome_id}_s1_AGP.agp
                    │   ├── salsa
                    │   │   └── {genome_id}_{genome_id}_s2.agp
                    │   └── {genome_id}_s1.fasta.gz                                     Hap1 bionano scaffolds and unscaffolded contigs
                    ├── hap2
                    │   ├── bionano
                    │   │   ├── {genome_id}_NGS_contigs_scaffold_NCBI_trimmed.fasta.gz
                    │   │   ├── {genome_id}_NGS_contigs_not_scaffolded_NCBI_trimmed.fasta.gz
                    │   │   ├── {genome_id}_conflicts.txt
                    │   │   ├── {genome_id}_hybrid_scaffold_report.txt
                    │   │   └── {genome_id}_s1_AGP.agp
                    │   ├── salsa
                    │   │   └── {genome_id}_{genome_id}_s2.agp
                    │   └── {genome_id}_s1.fasta.gz                                     Hap2 bionano scaffolds and unscaffolded contigs
                    ├── hifiasm
                    │   ├── {genome_id}__hifiasm.log
                    │   ├── {genome_id}__raw_unitig.gfa.gz
                    │   ├── {genome_id}_hap1_contig_graph.gfa.gz
                    │   └── {genome_id}_hap2_contig_graph.gfa.gz
                    ├── meryl
                    │   └── {genome_id}_.meryldb.tar.gz
                    ├── {genome_id}_hap1_c.fasta.gz                                     Hap1 contigs
                    └── {genome_id}_hap2_c.fasta.gz                                     Hap2 contigs

assembly_vgp_standard_1.0

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── assembly_{pipeline}_{ver}/     (pipeline: vgp_standard, vgp_HiC, vgp_trio, cambridge, ...)
                  ├── intermediates/
                  │   ├── falcon_unzip/                            FALCON unzip intermediate files
                  │   ├── purge_haplotigs/                         purge_haplotigs intermediate files
                  │   ├── scaff10x/                                Scaff10X intermediate files
                  │   ├── bionano/                                 Bionano TGH intermediate files
                  │   ├── salsa/                                   Salsa intermediate files
                  │   ├── arrow/                                   Arrow polishing intermediate files
                  │   ├── longer_freebayes_round1/                 Longranger freebayes polishing intermediate files (round1)
                  │   ├── longer_freebayes_round2/                 Longranger freebayes polishing intermediate files (round2)
                  │   ├── {genome_id}_c1.fasta.gz                  Pacbio FALCON-Unzip assembly primary contigs (haplotype 1)
                  │   ├── {genome_id}_c2.fasta.gz                  Pacbio FALCON-Unzip assembly associated haplotigs (haplotype 2)
                  │   ├── {genome_id}_p1.fasta.gz                  purge_haplotigs curated primary assembly (taking c1 as input)
                  │   ├── {genome_id}_p2.fasta.gz                  purge_haplotigs curated haplotigs (purged out from c1)
                  │   ├── {genome_id}_q2.fasta.gz                  c2 + q2 for future polishing
                  │   ├── {genome_id}_s1.fasta.gz                  2-rounds of scaff10x; scaffolding p1.fasta
                  │   ├── {genome_id}_s2.fasta.gz                  Bionano TGH; hybrid scaffold of 2 enzymes over s1.fasta
                  │   ├── {genome_id}_s3.fasta.gz                  Salsa scaffolding with Arima hiC libraries over s2.fasta
                  │   ├── {genome_id}_t1.fasta.gz                  Arrow polishing over s3 + q2
                  │   ├── {genome_id}_t2.fasta.gz                  1 round of longranger_freebayes polishing over t1.fasta
                  │   └── {genome_id}_t3.fasta.gz                  2nd round of longranger_freebayes polishing over t2.fasta
                  ├── {genome_id}.pri.asm.YYYYMMDD.fasta.gz        Final assembly (primary)
                  └── {genome_id}.alt.asm.YYYYMMDD.fasta.gz        Final assembly (alternate haplotigs)

Detailed intermediate assembly names and rules for v2

intermediate_name	full_verbal	description
c1	primary	hifiasm primary contigs
c2	alternate	hifiasm alternate contigs
hap1	haplotype 1	hifiasm haplotype 1 contigs, generated from hifiasm with HiC-phasing
hap2	haplotype 2	hifiasm haplotype 2 contigs, generated from hifiasm with HiC-phasing
p1	purged primary contigs	purged primary contigs
p2	purged haplotigs	haplotigs removed from primary contigs during purging
q2	purged alternate contigs	p2 concatenated to c2, and then undergone purging
s1	bionano scaffolds	hybrid scaffolds and un-scaffolded contigs from bionano
s2	Hi-C scaffolds	Hi-C scaffolds, which can be generated from contigs or from bionano scaffolds

Mitochondrial Assemblies

  /
  └── species/
      └── {Genus_species}/
          └── {ToLID}/
              └── assembly_MT_{institution}/
                └── {genome_id}.MT.YYYYMMDD.fasta.gz

It is acceptable to name the directory assembly_MT{suffix}, where the suffix is an underscore followed by some informative string. This can be useful when more than 1 mitochondrial assemblies were generated so that they can be easily differentiated from each other. Examples of informative strings are the location or institution that generated the assembly, a version, or a date.

Old File Structure

The current structure is based on this specification, with changes reflecting subsequent discussions and new data types. The following is how the file tree looked before:

/
└── species/
    └── {Genus_species}/
        └── {ToLID}/
            ├── assembly_{pipeline}_{ver}/
            ├── assembly_curated/
            ├── assembly_MT/
            ├── genomic_data/
            │   ├── ont/
            │   │   ├── reads1.bam
            │   │   ├── reads1.fastq.gz
            │   │   ├── reads2.bam
            │   │   └── reads2.fastq.gz
            │   └── pacbio/
            │       ├── reads1.fastq.gz
            │       ├── reads2.fastq.gz
            │       └── reads3.fastq.gz
            └── transcriptomic_data/
                └── {tissue_id}/
                    ├── illumina/
                    │   ├── reads1_1.fastq.gz
                    │   ├── reads1_2.fastq.gz
                    │   ├── reads2_1.fastq.gz
                    │   └── reads2_2.fastq.gz
                    └── pacbio/
                        ├── reads1.fastq.gz
                        ├── reads2.fastq.gz
                        └── reads3.fastq.gz

The primary changes to this structure are the addition of new directories under genomic_data. Generally, each data type is named after the company that generated it. This has changed slightly since multiple companies are generating multiple types of data. Try not to let the inconsistency get to you. If you have a data type not specified, please reach out to #data-coord on the VGP Slack workspace for a discussion on naming.