This is a DRAFT. Please send comments to #data-coord on Slack.

Easy, Step-by-Step Data Upload Guide

This guide follows the general recipe described in these data ingress instructions.

Step 0 – Pre-Upload Tasks

Before your data can go on Genomeark, you will need to obtain a Tree of Life Identifier (ToLID), as described in the data ingress instructions. You can still upload your data onto the test bucket without one, but we will not be able to move it to the primary bucket and display it on the website until a ToLID is assigned. Please start the process before or concurrent with data upload.

You will also need information for various READMEs (described here) and the metadata.yaml file (described here).

Step 1 – Data Organization

This step demonstrates organizing your data, and you must extrapolate to your specific data and circumstances. The objective here is to create a directory with contents that can eventually be placed on Genomeark without modification in the following location: /species/{Genus}_{species}/{ToLID}/. For this example, we will assume you are uploading PacBio HiFi data, ONT simplex data, Arima Hi-C data, and PacBio Iso-seq (RNA) data. We assume you have not yet generated any assemblies. We assume all data is from the same individual, and that the RNA data is generated from brain tissue. Not sure how to organize the data for your specific project? Please review the documentation on bucket structure.

Step 1.1 – Create Directories

mkdir genomeark-upload
cd genomeark-upload
mkdir -p genomic_data/{pacbio_hifi,ont,arima} transcriptomic_data/brain/pacbio

Step 1.2 – Localize Files

Copy (cp/rsync), move (mv), or link (ln -s) all data files. We will demonstrate using linking under the assumption that you have the data available on your system already in some organization that already works for you.

# beginning inside the genomeark-upload directory

####################
# PacBio HiFi data #
####################
cd genomic_data/pacbio_hifi
ln -s /path/to/movie-1-name.subreads.bam ./
ln -s /path/to/movie-1-name.subreads.bam.pbi ./
ln -s /path/to/movie-1-name.hifi_reads.bam ./
ln -s /path/to/movie-1-name.hifi_reads.fastq.gz ./
ln -s /path/to/movie-1-name.deepconsensus-1.2.0.fastq.gz ./
# repeat for any other cells
cd ..

############
# ONT data #
############
cd ont
ln -s /path/to/flowcell-1-name.guppy-6.4.6.bam ./
ln -s /path/to/flowcell-1-name.guppy-6.4.6.fastq.gz ./
mkdir -p fast5/flowcell-1-name
cd fast5/flowcell-1-name
for FAST5 in /path/to/fast5s/flowcell-1-name_*.fast5
do
	ln -s ${FAST5} ./
done
cd ../..
# repeat for any other flowcells
cd ..

#############
# Hi-C data # (from Arima Genomics in this example)
#############
cd arima
ln -s /path/to/library-and-lane-1-name_1.fastq.gz ./
ln -s /path/to/library-and-lane-1-name_2.fastq.gz ./
# repeat for any other lanes and/or libraries
cd ..

#######################
# PacBio Iso-Seq data #
#######################
cd ../transcriptomic_data/brain/pacbio
ln -s /path/to/movie-1-name.subreads.bam ./
ln -s /path/to/movie-1-name.subreads.bam.pbi ./
ln -s /path/to/movie-1-name.subreadset.xml ./
ln -s /path/to/isoseq-output-name.bam ./
ln -s /path/to/isoseq-output-name.bam ./
cd ../../..

Step 1.3 – Provide Checksums

Provide MD5 checksums for all data files. If you have them already, copy them locally, possibly renaming the internal file names if the local filenames to not match the source names. Please provide one checksum file per datatype directory. Sometimes it is reasonable to provide separate files for subdirectories, e.g., for each flowcell’s worth of ONT signal data (usually in FAST5 format, chunked into many files). In this example, that would be 4 files, one per data type. Possibly one more file for each ONT flowcell if FAST5s were provided. Use a reasonable filename for the checksums file, e.g., files.md5. This example will assume you need to generate all the MD5 sums, but copying existing ones would obviously be faster if you have them.

# beginning inside the genomeark-upload directory

####################
# PacBio HiFi data #
####################
cd genomic_data/pacbio_hifi
md5sum *.bam *.pbi *.fastq.gz > files.md5
cd ..

############
# ONT data #
############
cd ont
md5sum *.bam *.fastq.gz > files.md5
cd fast5
cd flowcell-1-name
md5sum *.fast5 > files.md5
cd ..
# repeat for any other flowcells
cd ../..

#############
# Hi-C data # (from Arima Genomics in this example)
#############
cd arima
md5sum *.fastq.gz > files.md5
cd ..

#######################
# PacBio Iso-Seq data #
#######################
cd ../transcriptomic_data/brain/pacbio
md5sum *.bam *.pbi *.xml *.fastq.gz > files.md5
cd ../../..

Step 1.4 – Write READMEs

Provide relevant details in README files, one (or more as needed) in each data type directory.

# beginning inside the genomeark-upload directory

for DATATYPE_DIR in genomic_data/{pacbio_hifi,ont,arima} transcriptomic_data/brain/pacbio
do
	vim ${DATATYPE_DIR}/README
done

Step 2 – UUID & Descriptive Name

Step 2.1 – UUID

Generate a UUID. You can search the web to find an online tool to do this for you. However, it is simple to accomplish on the command line, e.g., with uuidgen from the util-linux package. macOS provides it’s own binary for this purpose. Simply run the following command and record the result.

uuidgen

Future steps in this guide will assume the UUID you generated is available in the variable ${UUID}.

Step 2.2 – Descriptive Name

Choose a short, descriptive name and record it. Consider using your institution’s abbreviation, ToLID, species name (common or some abbreviation of the scientific name), specimen/sample name, data type(s), date, tissue(s), and/or assembly name or method when crafting the descriptive name. Do not include any personally-identifiable information. Use only alphanumeric ASCII characters, dashes, periods, and underscores. By definition, that means no whitespace is permitted. Here are a few examples: “fCarIgn1_seq_data”, “stonefly_HiFi-ONT”, & “JHU-Maize-Verkko-asm”. Future steps in this guide will assume your chosen name is available in the variable ${NAME}.

Step 3 – Upload the Data

Step 3.1 – Share `metadata.yaml`

You can share the metadata.yaml file you generated via Slack or email. If you are comfortable with git, you can instead make a pull request with the new information in the genomeark/genomeark-metadata GitHub repository.

Step 3.2 – Sync the Local Directory to the S3 Bucket

# beginning inside the genomeark-upload directory

aws --profile=genomeark s3 sync ./ s3://genomeark-upload/incoming/${UUID}--${NAME}/

Step 4 – Transfer to Primary Bucket

When you’re ready, notify us. We’ll copy your data to the primary genomeark bucket, removing your copy from the temporary space in the genomeark-upload bucket.