This is a DRAFT. Please send comments to #data-coord
on Slack.
Easy, Step-by-Step Data Upload Guide
This guide follows the general recipe described in these data ingress instructions.
Step 0 – Pre-Upload Tasks
Before your data can go on Genomeark, you will need to obtain a Tree of Life Identifier (ToLID), as described in the data ingress instructions. You can still upload your data onto the test bucket without one, but we will not be able to move it to the primary bucket and display it on the website until a ToLID is assigned. Please start the process before or concurrent with data upload.
You will also need information for various README
s
(described here) and the
metadata.yaml
file (described here).
Step 1 – Data Organization
This step demonstrates organizing your data, and you must extrapolate to your
specific data and circumstances. The objective here is to create a directory
with contents that can eventually be placed on Genomeark without modification in
the following location: /species/{Genus}_{species}/{ToLID}/
. For this example,
we will assume you are uploading PacBio HiFi data, ONT simplex data, Arima Hi-C
data, and PacBio Iso-seq (RNA) data. We assume you have not yet generated any
assemblies. We assume all data is from the same individual, and that the RNA
data is generated from brain tissue. Not sure how to organize the data for your
specific project? Please review the documentation on
bucket structure.
Step 1.1 – Create Directories
mkdir genomeark-upload
cd genomeark-upload
mkdir -p genomic_data/{pacbio_hifi,ont,arima} transcriptomic_data/brain/pacbio
Step 1.2 – Localize Files
Copy (cp
/rsync
), move (mv
), or link (ln -s
) all data files. We will
demonstrate using linking under the assumption that you have the data available
on your system already in some organization that already works for you.
# beginning inside the genomeark-upload directory
####################
# PacBio HiFi data #
####################
cd genomic_data/pacbio_hifi
ln -s /path/to/movie-1-name.subreads.bam ./
ln -s /path/to/movie-1-name.subreads.bam.pbi ./
ln -s /path/to/movie-1-name.hifi_reads.bam ./
ln -s /path/to/movie-1-name.hifi_reads.fastq.gz ./
ln -s /path/to/movie-1-name.deepconsensus-1.2.0.fastq.gz ./
# repeat for any other cells
cd ..
############
# ONT data #
############
cd ont
ln -s /path/to/flowcell-1-name.guppy-6.4.6.bam ./
ln -s /path/to/flowcell-1-name.guppy-6.4.6.fastq.gz ./
mkdir -p fast5/flowcell-1-name
cd fast5/flowcell-1-name
for FAST5 in /path/to/fast5s/flowcell-1-name_*.fast5
do
ln -s ${FAST5} ./
done
cd ../..
# repeat for any other flowcells
cd ..
#############
# Hi-C data # (from Arima Genomics in this example)
#############
cd arima
ln -s /path/to/library-and-lane-1-name_1.fastq.gz ./
ln -s /path/to/library-and-lane-1-name_2.fastq.gz ./
# repeat for any other lanes and/or libraries
cd ..
#######################
# PacBio Iso-Seq data #
#######################
cd ../transcriptomic_data/brain/pacbio
ln -s /path/to/movie-1-name.subreads.bam ./
ln -s /path/to/movie-1-name.subreads.bam.pbi ./
ln -s /path/to/movie-1-name.subreadset.xml ./
ln -s /path/to/isoseq-output-name.bam ./
ln -s /path/to/isoseq-output-name.bam ./
cd ../../..
Step 1.3 – Provide Checksums
Provide MD5 checksums for all data files. If you have them already, copy them
locally, possibly renaming the internal file names if the local filenames to not
match the source names. Please provide one checksum file per datatype
directory. Sometimes it is reasonable to provide separate files for
subdirectories, e.g., for each flowcell’s worth of ONT signal data
(usually in FAST5 format, chunked into many files). In this example, that would
be 4 files, one per data type. Possibly one more file for each ONT flowcell if
FAST5s were provided. Use a reasonable filename for the checksums file, e.g.,
files.md5
. This example will assume you need to generate all the MD5 sums,
but copying existing ones would obviously be faster if you have them.
# beginning inside the genomeark-upload directory
####################
# PacBio HiFi data #
####################
cd genomic_data/pacbio_hifi
md5sum *.bam *.pbi *.fastq.gz > files.md5
cd ..
############
# ONT data #
############
cd ont
md5sum *.bam *.fastq.gz > files.md5
cd fast5
cd flowcell-1-name
md5sum *.fast5 > files.md5
cd ..
# repeat for any other flowcells
cd ../..
#############
# Hi-C data # (from Arima Genomics in this example)
#############
cd arima
md5sum *.fastq.gz > files.md5
cd ..
#######################
# PacBio Iso-Seq data #
#######################
cd ../transcriptomic_data/brain/pacbio
md5sum *.bam *.pbi *.xml *.fastq.gz > files.md5
cd ../../..
Step 1.4 – Write READMEs
Provide relevant details in README
files, one (or more as needed) in each
data type directory.
# beginning inside the genomeark-upload directory
for DATATYPE_DIR in genomic_data/{pacbio_hifi,ont,arima} transcriptomic_data/brain/pacbio
do
vim ${DATATYPE_DIR}/README
done
Step 2 – UUID & Descriptive Name
Step 2.1 – UUID
Generate a UUID. You can search the web to find an online tool to do this for
you. However, it is simple to accomplish on the command line, e.g., with
uuidgen
from the
util-linux package. macOS provides
it’s own binary for this purpose. Simply run the following command and
record the result.
uuidgen
Future steps in this guide will assume the UUID you generated is available in
the variable ${UUID}
.
Step 2.2 – Descriptive Name
Choose a short, descriptive name and record it. Consider using your
institution’s abbreviation, ToLID, species name (common or some
abbreviation of the scientific name), specimen/sample name, data type(s), date,
tissue(s), and/or assembly name or method when crafting the descriptive name.
Do not include any personally-identifiable information. Use only alphanumeric
ASCII characters, dashes, periods, and underscores. By definition, that means
no whitespace is permitted. Here are a few examples:
“fCarIgn1_seq_data”, “stonefly_HiFi-ONT”, &
“JHU-Maize-Verkko-asm”. Future steps in this guide will assume your
chosen name is available in the variable ${NAME}
.
Step 3 – Upload the Data
Step 3.1 – Share metadata.yaml
You can share the metadata.yaml file you generated via Slack or email. If you are comfortable with git, you can instead make a pull request with the new information in the genomeark/genomeark-metadata GitHub repository.
Step 3.2 – Sync the Local Directory to the S3 Bucket
# beginning inside the genomeark-upload directory
aws --profile=genomeark s3 sync ./ s3://genomeark-upload/incoming/${UUID}--${NAME}/
Step 4 – Transfer to Primary Bucket
When you’re ready, notify us. We’ll copy your data to the primary
genomeark
bucket, removing your copy from the temporary space in the
genomeark-upload
bucket.