Position title: KCCG Genomics Summer Scholarship

Employer: Garvan Institute of Medical Research

Closing date: 30/11/2020

Brief position description: The Garvan Institute of Medical Research brings together world-leading clinicians and basic and translational researchers to break down barriers between traditional scientific disciplines and find solutions to disease. Founded in 1963, Garvan’s mission is to harness all the information encoded in our genome to better diagnose, treat, predict and prevent disease.

Our scientists work across four intersecting research themes: medical genomics, epigenetics, and cellular genomics; diseases of immunity and inflammation; cancer; and diseases of ageing affecting bone, brain and metabolism. In addition, three major Centres: The Kinghorn Centre for Clinical Genomics, the Garvan-Weizmann Centre for Cellular Genomics, and the Centre for Population Genomics.

The Kinghorn Centre for Clinical Genomics (KCCG), established by the Garvan Institute is an Australian research and sequencing centre delivering genomic information for clinical use. Our vision is to translate medical research into clinical care in Australia and beyond by integrating sequencing, bioinformatics and data management in a cutting-edge Genomics research environment.

The Opportunities

The KCCG is offering currently enrolled undergraduate students opportunities to carry out projects during summer 20120/2021. These projects provide hands-on research experience in the following topics:

1. Deep Learning Superstardom

This position is for assisting various projects of the Deep Learning Initiative. Activities involve contributing to development, testing, deployment and documentation of Deep Neural Networks that focus on analysing real time signals from Oxford Nanopore Sequencers, and SkyMapper project, which aims to make sense of complex-multi-dimensional data using imagification and analysis using Convolutional Neural Networks.

2. Community screening program – Laboratory and Bioinformatics consideration

The project will involve reviewing and optimising specific aspects of the laboratory and downstream bioinformatics challenges as part of TKCC’s community screening program. The program currently utilises whole genome sequencing (WGS) to provide valuable genetic and clinical insights of each participant. Evaluations of new software or programming packages for specific diseases could be expected e.g. analysis of tri-nucleotide repeats and structural rearrangements.

3. Community screening program - community consultation interface

Community consultation is of increasing interest and value to KCCG as part of its clinical translation research. This project will require a student to review existing solutions for gathering community input using online interfaces, consult with build a proof of concept in line with stakeholders’ expectations. This may include subscriptions to new polls and questions, incorporating media to engage community members, templates for short surveys and open questions with navigable interfaces, and secure storage and analysis of community responses.

The project is expected to involve development of a user-friendly web interface to either SurveyMonkey or RedCap to engage, gather data and provide immediate feedback to communities on topics related to the use and reuse of genomic information for research and screening.

4. Community screening program – Dynamic Consent and the GeneTrustee™

A key part of Australia’s first genomic Community Genomics screening project is linking the participant’s consent, with their future lifetime access to their sequenced genome. This involves developing a series of APIs and Apps to connect the participant’s identity with their consent, their DNA sample, and the results of its analysis. Many components of this ecosystem have now been developed but are yet to be linked and integrated into a working whole. The project will involve developing proof-of-concept linkages between these ecosystem components, leading to the outcome of a working prototype of an end-to-end Dynamic Consent and GeneTrustee™ workflow.

5. COVID-19 Detection via Chest Sounds Prototype
COVID-19 leaves telltale marks on the lung tissue and it even influences speech. In this project we hypothesize that chest sounds as can be detected by stethoscopes could be a better indicator of lung injury and COVID-19 signature and develop a proof of concept for the hypothesis on iOs or Android and and AI server.

6. Interactive genomes visualization

Project would require strong technical front-end skills and a passion for interactive visualization of complex scientific data. Current generation of web-based UIs for genomes visualization is limited by a number of factors, including performance of REST APIs, overall performance of back end systems and the volume of data and associated network bottlenecks. The goal of the project is to develop a web-based interactive visualization of complex genomics data, using gRPC streams to communicate with back end system.

7. Mapping the locations of the control circuits of the human genome

The human genome has been sequenced. To everyone’s surprise, only ~1% of the genome codes for proteins and enzymes, and it is assumed that ‘control circuits’ must occupy a significant part of the remainder. We have recently developed bioinformatic tools that can locate a large proportion of these control circuits. In this summer project, we will commence making these tools accessible to the broader scientific community as a proposed new and valuable scientific resource. We will be approaching the problem from its two extreme ends:

1. Starting at the level of individual genes, we will be performing high-resolution mapping of the control circuits of individual genes, to increase our understanding of how these control circuits operate in detail

2. Starting at the level of the whole genome, we wish to generate a map that locates where in the genome the control circuits are situated.

There are (up to) three student positions available for this project: one student in each of the two sub-projects, and a third position that will ‘float’ between the two projects and build web- and user-interfaces to connect the projects as they each generate data that needs to be fed into the other.

8. Optimisation of popular bioinformatics software for RISC-V architecture

RISC-V is an open-source hardware architecture that is rapidly becoming popular. In the future, such opensource hardware architectures have the potential to be competitive with today’s popular RISC architectures such as ARM. We have recently demonstrated how ARM architecture can be exploited to design and develop prototypical embedded systems for portable genomic data processing. In this project, you will explore how the emerging RISC-V architecture can be exploited for such a use-case.

In this project, you will first port existing popular bioinformatics software that currently supports ARM processors (e.g. Minimap2, Samtools, Nanopolish/f5c) to work on RISC-V architecture. Then, you will optimise those ported tools to work efficiently on the RISC-V architecture. Optionally, based on the candidate’s performance and skillset, there are possibilities to extend the RISC-V architecture with application-specific instructions customised for specific genomic computations.

9. Automated identification of mode of inheritance for inherited disease

The pattern of inheritance of a disease can give us information about the type(s) of genetic mutations carried by patients and their relatives and the risk of further disease. Currently the interpretation of inheritance information is carried out as a manual step by genetic pathologists. This takes experience and time and can miss obscure or unusual inheritance patterns. This project will develop a bioinformatic framework to automate the identification of all possible modes of inheritance from a given family pedigree and highlight additional risk beyond the presenting patient.

The software developed will be part of command line pipelines, and have a separate web interface. This will form a base for integration into future clinical genetic diagnostic applications under development. Training will include bioinformatics software development and biological aspects of inheritance. Depending on the students aptitude and rate of progression, the project will expand further into clinical or research bioinformatics.

10. SquiggleKit update and web application extension

The management of raw nanopore sequencing data poses a challenge that must be overcome to facilitate the creation of new bioinformatics algorithms predicated on signal analysis. SquiggleKit is a toolkit for manipulating and interrogating raw nanopore sequencing data that simplifies file handling, data extraction, visualization and signal processing. Since its publication in 2019, many things have changed within the data and processing areas. Two updates to the software are needed to maintain its usability, and a further extension to the visualisation methods will enable future developments and data exploration.

You will first port some of the scripts from python2 to python3. Second, you will add some functions to extend the tools usability. And third, you will create a web application allowing for interactive navigation and plotting of data.

11. Excel to Database

Many lab groups are primarily managing data via spreadsheets and file names, with all the problems that this approach implies. This project would involve create database-driven apps with a web front end and migrating existing data onto the new platform.

12. Dataset registry and fetch/migration tool

There are many datasets throughout Garvan. These are managed by many different people on many different platforms (gagri, pandora, BaseSpace, Cloudstor, DNANexus, etc). Not only is it difficult to get an overview of what datasets exist, sometimes individual teams lose track of what data they have, particularly as a result of staff and student turnover. Moreover, different platforms use different tools to upload and download data, making it difficult to share data with collaborators or migrate to a more cost-effective platform.

This project involves building a dataset registry to capture key properties of each dataset, such as ownership, accessibility, storage location and a high level description of the kind of data that the dataset contains. The registry can be used for existing datasets as well as new datasets that are created. As an incentive for taking the time to register datasets with the new system, an extension of the project is to build a tool for transferring data between platforms, based on the metadata stored in the registry. This should greatly simplify collaboration, as well as migration, backups, etc.

13. Natural Language Processing to Extract Clinical Phenotypes from Biomedical Literature

Clinical Genomics tries to understand how changes in our genome (the 'genotype') lead to clinical abnormalities in a patient (the 'phenotype'). This requires analysing large amounts of genomic and phenotypic data. Unfortunately, phenotype data in the biomedical literature consists mainly of free text and is not standardised: ‘large head’, ‘big head’ and ‘macrocephaly’ all mean the same thing. This makes computational processing of phenotype data very difficult.

In this project we will explore Natural Language Processing (NLP) techniques to extract phenotypic descriptions from the biomedical literature and map them to the Human Phenotype Ontology (HPO), a standard vocabulary of over 10’000 terms used to describe the clinical features of patients with rare diseases. Once encoded in HPO terms, the phenotypic information can be computationally processed and enables sophisticated applications such as automated diagnosis.

The positions will be offered full-time for 10 weeks and provide an allowance of $5000 as a tax-free scholarship.

14. Gaucher disease (GD) is an autosomal recessive disorder caused by inherited deficiency of the enzyme beta-glucocerebrosidase (GBA). Individuals who lack working copies of this gene will develop one of several forms of GD: Type I GD (the most common) affects various organs in the body, but does not affect the nervous system. Other GD Types affect the nervous system and are much more serious, some leading to death. A recent development has been the finding that some genetic variants of GBA predispose the individual to develop Parkinson disease later in life. Variants of GBA in those of Ashkenazi Jewish (AJ) ancestry are much more common than in the general population. But paradoxically, although GD is more common in those of AJ ancestry, the incidence of GD is only a fraction of that which should occur based on the prevalence of the GBA variants. In short, something else must be protecting the individual from getting GD, even though they have a faulty GBA gene. We have the whole genome sequence of several thousand non-AJ individuals, and are obtaining the genomic sequence AJ individuals. This project involves collating this data, to understand and identify what might be causing this protective effect. Is it a second gene? Is there some other subtle genetic variant? And might this be of help not only to understanding GD, but also Parkinson disease?

How to Apply

All applications must be submitted via the Garvan Careers site. Applications from other sites/channels will not be considered.

Your application should include:

Academic transcripts
Which project(s) you are applying for
Closing Date

The position will remain open until filled. We will be reviewing applications as they are received, and so we encourage you to submit your application as soon as possible.

Job website: http://garvan.wd3.myworkdayjobs.com/en-US/garvan_institute/job/Sydney/KCCG-Genomics-Summer-Scholarship_PRF5656-1

Contact name: Michelle Earle

Contact email: m.earle@garvan.org.au