Importing Custom Datasets

You can import your own datasets into CRUX using the ‘Import Data’ module.

Where do I start?

I want to prepare mutation data for import into CRUX
I want to create clinical annotation files files to enable survival analysis / virtual cohort creation / enrichment analysis (must already have mutation data prepared)

Step 1: Preparing your mutation data

There are several ways to import your own data into CRUX.

Use the table below to identify the most convenient method depending on your starting point.

Input Filetype	Required Formatting
MAF	Directly Supported: Import Straight to CRUX
ANNOVAR (TSV)	Directly Supported: Import Straight to CRUX
VCFs [2-sample, unannotated]	Convert to ANNOVAR or Convert to MAF
VCFs [2-sample, vep-annotated]	Convert to MAF
SOLID GFF3	Convert to ANNOVAR
Complete Genomics (TSV)	Convert to ANNOVAR
Complete Genomics (masterVar)	Convert to ANNOVAR

Option 1: Import MAF file

In recent times, the MAF file format has become a popular, tabular way of storing mutational data. It is the format used by the genomic data commons that houses public TCGA and PCAWG data.

CRUX supports direct import of MAF files.

MAF files can be quite large, but CRUX requires only a small subset of the possible columns:

Column Name	Description	Valid Values
Tumor_Sample_Barcode	Sample Identifier	Alphanumeric
Hugo_Symbol	Gene Name (from HUGO consortium)	Alphanumeric
Chromosome	Chromosome	Alphanumeric
Start_Position	Mutation Start (1-based)	Numeric
End_Position	Mutation End (1-based)	Numeric
Reference_Allele	Reference Allele	A,C,T,G (variable length)
Tumor_Seq_Allele2	Alt Allele (present in tumour)	A,C,T,G (variable length)
Variant_Classification	Translational effect of variant allele	See here GPT(CROSS link to Variant_Classifications data dictionary section)
Variant_Type	Type of mutation	SNP,DNP,TNP,ONP,INS,DEL,Consolidated

If you have a MAF file, you can Import it directly into CRUX

Example MAF file: APL_primary_and_relapse.maf

Option 2: Create and Import ANNOVAR annotation files

Annovar is a widely used tool for annotating the impact of genomic variants. It is a standard part of many bioinformatics pipelines.

Annovar input is tabular, and includes various annotation columns, only a small subset is required for reading to work correctly.

Required Columns:

Column Name	Description	Valid Values
Chr	Chromosome	e.g., 1, 2, X, Y
Start	Start position	Integer
End	End position	Integer
Ref	Reference allele	Single base or indel
Alt	Alternate allele	Single base or indel
Func.refGene OR Func.ensGene	Functional annotation	String
Gene.refGene OR Gene.ensGene	Gene symbol	String
GeneDetail.refGene OR GeneDetail.ensGene	Gene details	String
ExonicFunc.refGene OR ExonicFunc.ensGene	Exonic function annotation	String
AAChange.refGene OR AAChange.ensGene	Amino acid change annotation	String

If you already have an ANNOVAR annotation file you can Import it directly into CRUX

Example Annovar file: demo_annovar.txt

Q: Ho do I get an ANNOVAR annotated file?

To obtain an ANNOVAR annotated file, you can either request your bioinformatics team to run it for you, or manually perform the annotation without programming using the process documented below.

Creating ANNOVAR files (Using only Graphical Interfaces)

Supported Starting Filetypes
VCFs (Single Sample)
VCFs (2-sample, tumor-normal) [1]
SOLID GFF3
Complete Genomics TSV
Complete Genomics masterVar

Warning

For large cohorts (>10 samples) manually running ANNOVAR on each single/two-sample VCF is repetitive and time consuming.

Modern analysis pipelines typically output either ANNOVAR files whch can be directly imported to CRUX or VEP-annotated VCFs which can be converted to MAFs all at once using the INTERCHANGE web app.

Please consider asking whoever runs your analysis pipelines if either ANNOVAR or VEP-annotated files are available.

Visit wAnnovar
Input your files (and select the matched Input Format from the dropdown) Example VCF
Configure Paramaters
1. Choose an appropriate reference genome.
2. Select the relevant input format (e.g. VCF if you’ve uploaded a vcf file)
3. Leave the remaining settings as default (see screenshot below for expected values)
Download annovar (TXT) file (genome summary results). Clicking the link will open the annotation file in a new tab. Hit ctrl/command + S to download this file.
Repeat for each single sample VCF (or other input files) in your cohort
Import annovar files into CRUX

Creating ANNOVAR files (for bioinformaticians)

We reccomend using the following settings when performing commandline annotation of annovar

table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol (refGene),cytoBand,dbnsfp30a -operation (g),r,f -nastring NA

Note

CRUX will attempt to auto-detect as much as possible about the features of your annovar annotation. It requires that annovar was run with gene based annotation as a first operation, before including any filter or region based annotations. Please be aware that the CRUX annovar parser performs no transcript prioritization.

Option 3: Convert VCFs To MAF using Interchange

Note

To maximise accessibility, this section describes how to convert VCFs to MAF files using web apps only (no coding).

If you are comfortable working on the commandline we reccomend trying vcf2maf

Interchange is the easiest way to convert vep-annotated VCFs into cohort MAF files compatible with CRUX.

If you have unannotated VCFs, please first annotate with VEP as described here

Once you have VEP-annotated VCFs head to the Interchange Web App and select VCF to MAF conversion

Then select all your VCF files as pictured below

Fill in the metadata about your cohort in the step2: panel.

You may need to alter the expected ID of tumour samples/normal samples to match your VCFs. Most somatic variant callers used in tumor-normal pipelines produce 2-sample vcfs with the tumour sample named ‘TUMOR’ and the normal sample named ‘NORMAL’. This is what the interchange vcf2maf converter expects. If your VCFs differ from this (you can open vcfs in a text editor to check this) then you may need to change it. If tumour sample name in your VCF changes from one sample to another, please check ‘Assume IDs in VCF match Tumor Sample Barcodes’

Example of opening up a VCF to checking how tumor and normal samples are named

Check the VCF file -> Tumour Name Mappings and Interchange correctly guesses the appropriate sample name for each file. You can manually change these sample names if required.

Finally, click convert to download your MAF file.

Step 2: Prepare Clinical Annotation Files

In addition to loading your mutation data, CRUX supports optional import of any clinical annotations. If you have any sample-level data, e.g. disease subtype, patient gender, or age, we reccomend importing these so that they can be added to visualisations, used to define virtual cohorts, and to facilitate study of the relationships between clinical annotations and mutational profile

The clinical annotations file must be a tsv/csv with a header row. It must contain a ‘Tumor_Sample_Barcode’ column containing sample IDs that match the Tumor_Sample_Barcode column of your mutation file.

You can then add as many columns where each column represents a variable.

For example:

Tumor_Sample_Barcode	Disease_Subtype	Gender
sample1	Subtype1	Female
sample2	Subtype1	Male
sample3	Subtype2	Male
sample5	Subtype2	Male

The file you’d actually import would be example.csv

Survival Analysis

To identify genetic biomarkers of good / poor survival you need to include survival data in your clinical annotation file. Two columns are required:

days_to_last_followup
vital_status (1=dead; 0=alive)

Step 3: Importing your dataset into CRUX

If you want to look at your own data in CRUX, prepare your file in MAF/ANNOVAR format as described above then import it using the ‘Import Data’ module

Optionally import any sample level metadata (an example file can be downloaded and opened using excel). Please see the Prepare Clinical Annoation Files section for details.

Choose a name and description for your dataset (all fields must be filled in to continue)

Add the dataset to our data pool

You should now be able to select your dataset for use in any of the analysis/visualisation modules

Annotating Variants with VEP (Graphical tools only)

Navigate to VEP and create a new job
Ensure the chosen ‘Assembly’ is appropriate. If your variants are called based on hg38/GRCh38 reference genomes the link above is appropriate. If your pipelines use hg19/GRCh37 reference genomes you’ll need to use the GRCh37 version
Upload your VCF
Configure Vep with the following settings
1. Transcript Database to Use: Ensembl/GENCODE transcripts.
  
  Note
  
  You can use other transcript databases so long as you ensure consistency between the VCFs in your cohort (and any other cohort you want to compare results to)

Identifiers: Check Gene Symbol & Transcript Version & HGVS
Additional Annotations > Transcript Annotation: Check Transcript Biotype & Identify Canonical Transcripts
Variants and Frequency Data: check gnomAD (exomes) allele frequencies

Run VEP and download results as VCF
Repeat for each VCF in your cohort

Data Dictionaries

A collection of data dictionaries for various filetypes

(MAF) Valid Variant Classifications

Frame_Shift_Del

Frame_Shift_Ins

In_Frame_Del

In_Frame_Ins

Missense_Mutation

Nonsense_Mutation

Silent

Splice_Site

Translation_Start_Site

Nonstop_Mutation