Data storage and compression using CRAM
Thu, 22 Nov 2018 21:43:00 +0000
As our database expands in scale, managing and storing large genomic sequence data has become a challenge, particularly with the large BAM files. As we head towards cheaper sequencing costs we are anticipating a tsunami of data as researchers switch from exome to whole genome sequencing. In preparation, we've taken early steps of further compressing our BAMs using CRAM compression (lossless). Our database system manages its disk space autonomously such that if our allocated disk space reaches a threshold of 80%, it will automatically convert the oldest BAMs to CRAMs and archive them to tape storage, making way for newer datasets. Our testing has shown that the CRAM compression format saves roughly 30% and will provide significant costs savings. Users that wish to have access to the archived BAMs can click a button from our web interface and the system will automatically restore the CRAM from tape and convert them back to BAM.
Search filter: Clinical diagnosis and provisional variants
Fri, 12 Oct 2018 03:12:00 +0000
We've added 2 new search filters for Clinical diagnosis and Provisional variants. By combining these 2 filters together, we can look for all patients that fall under the same disease category and look for provisional variants they may have in common. This is useful in a research context where we may be able to find a pattern of variants that may be influential in disease pathogenesis. Both the Clinical diagnosis and Provisional variant information is pulled in from our related Patient Database. Therefore, for these filters to be useful, they must first be specified in the Patient database prior to use in the search.
Predictive relatedness, sex and ancestry reports
Fri, 31 Aug 2018 04:23:00 +0000
We've recently added a new step in our pipeline to use the genetic data to predict the degree of relatedness, sex and ancestry. This is particularly useful as a quality check to spot potential sample mix ups, poor DNA quality, contamination, errors in patient details provided etc. In the event of a possible error, users are automatically notified with the reports attached in the email for further investigation. We are currently running the reports retrospectively for all of our previous data sets and have already found a some data entry errors. In such cases, we may want to rerun the pipeline analysis as such errors can affect the variant prioritization. These reports are also available as downloads in the 'datasets' section.
Prediction filtering can be separated by logical OR/AND
Fri, 31 Aug 2018 04:14:00 +0000
Previously when combining filters on predictions and scores, the search automatically separated each filter on a conditional AND by default. The change we've made recently, is to allow users to specify the logical operator (AND/OR) between the prediction and scores filters such that you can query the database by saying give me all the variants that have polyphen prediction 'probably damaging' OR clinvar prediction 'pathogenic' in a single query. Previously, if you had to do this, you would run separate searches for each polyphen and clinvar.
Search profiles - New filters
Mon, 23 Jul 2018 03:24:00 +0000
Previously only gene lists were supported in the Search Profile feature. Recently, we've added support for storing the lists of mutations types (exon, splice, missense, nonsense), ExAC frequency, Gnomad frequency.
Improved structural variation prioritisation
Tue, 26 Jun 2018 04:59:00 +0000
Matt Field has made some significant improvements to the the prioritisation of structural variants (SV) and we've updated our database to reflect those changes which include combined report for both SV callers, prioritise SVs where exons are most likely to be impacted, max length filter applied to most SV types, and whether event is novel/known. These changes dramatically reduced the number of high priority SVs from >3000 to around 90 and 449 medium priority SVs. Please note that we do not retrospectively re-analyse and update the SV reports for any of the previous records. This only affects any new data.
Handling control samples
Tue, 26 Jun 2018 04:44:00 +0000
We don't necessarily want to see variants from our control samples in the database, but at the same time we still want to be able to download the VCFs and do SNP validation to ensure we don't have sample mix ups. We've created a separate page of control samples and their corresponding VCF for download.
Automated archiving of BAM files
Tue, 26 Jun 2018 04:41:00 +0000
Our capacity to keep BAM files available for download is a real challenge and we are faced with the constant pressure to free up diskspace as more projects come onboard. We've come up with a way to automatically archive BAM files that are older than 1 year to tape storage without any human intervention.
Health reports: GWAS
Tue, 26 Jun 2018 04:20:00 +0000
In addition to Clinvar and Snpedia, we've recently added GWAS Catalog to the health reports based on the the rsNumbers for a patient. GWAS is particularly useful in a research context by comparing variant frequencies in the affected population against a control (healthy) population using statistical analysis to establish a hypothetical link between variants and disease traits. In the health report under GWAS, we've added the following columns: disease traits, studies, risk allele, initial sample size, replication sample size, p-value and risk allele frequency. In GWAS it has been shown that false positives are not uncommon (false association between variant and disease) due to uncontrolled biases and so it's important to take into consideration whether any replicate studies were done to give more confidence to the hypothesized association.
Health reports: Clinvar & Snpedia
Tue, 22 May 2018 04:05:00 +0000
We've added a new feature where users can generate health reports downloadable in a Excel format from multiple datasources including Clinvar and Snpedia based on the patient's rsNumbers/variants and genotype. The health reports indicate the patient's risk factor associated with a particular disease/trait. It can take up to 20 mins to generate and an email is sent with the attached health report. Magnitude is a subjective measure of interest ranging between 0-10. The higher the number the more significant. A magnitude score of 2 or higher is probably worth investigating. A magnitude score of 4 or higher is definitely worth investigating. More info at: https://www.snpedia.com/index.php/Magnitude.
Excluding variants from search based on Patient study codes
Tue, 17 Apr 2018 01:48:00 +0000
When doing our own variant analysis, we often seek variants that shared between affected individuals, and we already provide this capability using the 'shared' filter. We recently added a new filter to take this search one step further by removing variants found in the unaffected individuals (usually from the same family). There is a new textbox called 'Exclude variants' where users can add patient study codes to exclude the variants found in these individuals from the variants found in the other individuals in a single search operation. Keep in mind, that each person will carry thousands of variants, so filtering in this way can be quite slow if no other filters are applied. So it is recommended that users apply as many filters as possible to narrow the search before using this functionality.
GnomAD ethnic frequencies exportable
Tue, 17 Apr 2018 01:15:00 +0000
We've added a new option for users to export GnomAD ethnic frequencies to excel which includes south asian, east asian, african american, jewish, non-finnish european, finnish and other minor allele frequencies (MAF). It's optional because we don't actually store the gnomAD frequencies in our database and have to fetch them from elsewhere making export slower especially when exporting thousands of variants. It's best to filter as much as you can before enabling this option.
Affected statuses: Database vs Pipeline
Tue, 03 Apr 2018 22:55:00 +0000
We recently introduced a new filter called 'Pipeline affected status'. This is not be confused with the other 'Affected status' or 'Disease status' filter which is taken from our Patient Database. The 'Pipeline affected status' differs such that you can reconfigure the pipeline to use a different affected status from what is set in the database to produce different cohort reports. This is useful in cases when the affected status applies to multiple phenotypes or diagnoses and you want to to do repeated cohort analysis under different conditions.
Phenotype to Genotype based variant searching
Thu, 15 Feb 2018 04:22:00 +0000
Expand your variant search based on known phenotype-genotype relationships. This filter only works if you have specified patient ids in the filters. The phenotypes collected from the specified patients are used to query OMIM for gene relationships. A new tab called 'Phenotype-Genotype' is displayed in the results showing the relationships between phenotypes and genes. This only works well for patients that have a good number of phenotypes captured in our databases.
RS number filter
Thu, 15 Feb 2018 04:21:00 +0000
Users can now search by rsNumbers in our search fitlers
Variants from cohort reports are now included
Thu, 15 Feb 2018 04:20:00 +0000
Previously only the variants from the SNV, INDEL and SV reports were included into our database. We've recently rebuilt our database to include all variants, even the questionable ones of poor quality, found in the cohort report because there are some suggestions of an inheritance pattern discovered during the pipeline pedigree analysis. This means more variants for you to browse than there was before.
Gene interactions - Genes don't work in isolation, and your gene lists shouldn't either
Wed, 20 Dec 2017 03:56:00 +0000
Genes don't work in isolation, and your gene lists shouldn't either. Researchers will often have a list of known genes to look for when prioritizing variants based on the patient's clinical diagnosis. but what should you do if no candidate variants can be found solely based on your gene list? There are many approaches, but one option is to expand the gene list based on known gene interactions and pathways. We rely on the highly curated database called BioGRID to expand the gene list to include the network of genes known to interact either directly or through protein-to-protein interactions. To use this new feature, there is a new checkbox called 'Gene interactions' which users can tick to expand their gene-based search in this way.
Wed, 20 Dec 2017 03:47:00 +0000
Users can now create their own search profiles as a way of storing commonly used search filters without having to repeatedly choose the same options over and over again. One example is to include your gene lists in a search profile. The search profiles are associated with the user only and are not shared.
BAI - BAM index files downloadable
Wed, 20 Dec 2017 03:45:00 +0000
The BAM index files, known as BAI files, are now available for download along with the BAM file. This is particularly useful when using IGV on your desktop.
New search filters: gnomAD frequency and INDEL ExAC frequencies
Wed, 20 Dec 2017 03:18:00 +0000
The bioinformatics pipeline has been updated to include gnomAD frequencies and added INDEL exac frequencies. Any new data generated from Dec 2017 onwards will have these new fields. However, none of the previously analyzed datasets will have them. They will have to be reanalyzed if you want these new fields populated. To go along with these new fields, we've added the new gnomAD frequency filter to our search page.
Exon coverage search
Tue, 12 Dec 2017 21:41:00 +0000
The sequencing and alignment process isn't perfect and often there are regions of poor coverage as a result of the pipeline analysis. Previously we made the coverage reports available for download as part of our datasets as 'exonReports'. We've taken it a step further by allowing users to search through these coverage reports based on gene, patient ID and coverage type (NO_COVERAGE, POOR_COVERAGE, PARTIAL_COVERAGE). To use this new feature, in the menus, choose 'Search exon coverage'. Furthermore, we added a new tab to display exon coverage to go along with the variant search results. The tab will only have results if users search by Patient ID and Gene. This way users can browse variants and the coverage results side-by-side providing a broader view over the quality of the variants being presented. In particular, this will be useful for difficult to diagnose patients for which no causal variants have been identified, where potentially disease-causing variants may lay hidden in uncovered regions of the genome.
CACPIC Frequencies exportable
Tue, 12 Dec 2017 21:31:00 +0000
Chinese frequencies using our healthy chinese controls are now exportable to Excel as an optional column. We've made it optional because these frequencies are calculated at runtime during the export process and can delay the completion of export. For those not interested in the CACPIC frequencies, leave the checkbox unticked.
Supplementary information available for download
Tue, 12 Dec 2017 21:27:00 +0000
As part of our datasets for download, we've added some supplementary information (generated as TXT files by the pipeline) to go along with the VCFs and BAMs files. The files contain information such as the cutoffs used to qualify variants as a PASS, which include things like read depth cutoffs, median quality score cutoffs and so on. Furthermore, the files also contain general statistics about total variants passed, the proportion of variants that are exonic, splice sites, number of distinct genes and averages of read depth and median quality. Another file called the 'readReport.summary' includes information about how many were paired, mispaired, aligned, and unaligned. The supplemantary files can be found under the 'Datasets' section.
Measuring Variant Conservation with GERP Score and Siphy
Thu, 07 Dec 2017 22:45:00 +0000
The more annotations, the better! We've recently added 2 new annotations to assist with variant prioritisation as a measure of variant conservation and these are GERP scores and Siphy. GERP stands for Genomic Evolutionary Rate Profiling. Conceptually, GERP is a method for the identification of slowly evolving regions in a multiple sequence alignment, defined as ‘constrained elements’. "Constrained elements are identified by comparing the observed to the expected rates of evolution for each window, and defining all those regions whose collective observed rates of evolution are significantly lower than would be expected under a null model." More simply, it is a score used to calculate the conservation of each nucleotide in multi-species alignment with ranges from -12.3 to 6.17, with 6.17 being the most conserved. Positive scores (observed fewer than expected) indicate that a site is under evolutionary constraint. Negative scores may be weak evidence of accelerated rates of evolution. The detailed description can be found in this publication: http://genome.cshlp.org/content/15/7/901.full SiPhy stands for SIte-specific PHYlogenetic analysis. A conservation score that takes the type of mutation into account. SiPhy scores are from dbNSFP, and are on the log odds scale, with most scores ranging between 0 and 20. Higher scores indicate higher conservation. More info can be found here: https://academic.oup.com/bioinformatics/article/25/12/i54/187307 These new annotations are displayed under the 'Latest annotations' tab and are loaded dynamically when the web page loads. Because we don't actually store the information, we can't make them filterable on the search page just yet.
View all previous posts
Allele Frequency filtering
Fri, 01 Dec 2017 04:19:00 +0000
The pipeline does the best it can in assigning variants a zygosity based on the allele frequencies and counts. Usually the cutoff is around 90%. However, once we reach below this threshold, it becomes less clear. Hence we now allow users to filter by allele frequency particularly useful in cases where zygosity is not always clear. Users can filter on the VARIABLE allele frequency as well as the REFERENCE allele frequency.