Identification of recombination hotspots and the associated nucleotide motifs in Bos Taurus (Cattle) Project leader: Chad HarlandProject Tutors: Lennart Karssen & Anna TorgashevaStudents: Philipp Junk, Debora Garza Hernandez, Dora Sribar, Quirze Castella and Vedran VukovićAbstract: In this project we attempted to identify nucleotide motifs associated with recombination and the differing PRDM9 alleles present in the Bos Taurus population via the utilization of Illumina whole genome sequence data. For this project a dataset consisting of SNPs and Indels for ~450 whole cattle genomes was provided, the 450 individuals formed ~60 three generation families of cattle each consisting of the two parents, a child and between 1-11 grand-offspring (children of the child). The dataset was filtered to reduce the estimated false positive rate from ~20% to <5% and the variants were then used via three different methods to identify either recombination events occurring the child of the trios using Haplotype or Mendelian based phasing (assigning parent of origin to each variant) or to identify recombination hotspots (utilizing patterns of Linkage disequilibrium in the genome). Once such events were identified the candidate regions were analyzed using the Homer motif identification package to identify nucleotide motifs shared amongst the regions. Homer was also used to generate a pool of motifs common through out the genome via by randomly selecting a similar number of regions from the genome for analysis. The results from the candidate recombination events and hotspots were to then be compared to those randomly selected motifs via a permutation test to determine which events were overrepresented in the test set. Due to the time constraints of the school a sufficiently large control set of motifs could not be generated so over represented motifs specific to recombination could not be identified and thus we were not able to continue on to associate any motifs with PRMD9 alleles. It would be possible to continue this work to by increasing the number of chromosomes analyzed for recombination events and by generating a sufficiently large enough set of motifs and their frequency in the genome to identify those enriched in recombination events or hotspotsHidden liability underlying common complex diseases Project leader: dr. Julia DimitrievaProject tutors: mr. Alexander Kurilshikov, dr. Olga ZaitsevaStudents: Maja Fabijanic, German Demidov, Zheng NingPre-requisites for students: Machine learning (tree-based methods, Random Forest, MIC), programming abilities in any programming language (R, C++, Java) , basic statistics, linux shellOutline: The aim of the project was to find biological correlates for the Crohn's disease (CD) and Ulcerative Colitis (UC) using predictive model based on IIBDGC (International IBD Genetic Consortium) dataset (23000 GWAS individuals) and CEDAR (Correlated Expression and Disease Associated Research) Gene Expression dataset (~350 healthy individuals). We have used pre-calculated SNP effect created by using BLUP and REML[1] methodology in order to predict the CD and UC liability in CEDAR individuals. This methodology allows us to assigned a “disease-like” phenotype for each CEDAR individual which could be further studied for associations with CEDAR gene expression data... (read more)Genome-wide association and genomic prediction for human blood lipids using multivariate analysis and machine learning methods
Outline: The aim of this project is to perform multi-trait genome-wide association study (GWAS) for human blood lipids, in order to boost discovery power in terms of quantitative trait loci detection and improve genomic prediction using multiple polygenic scores. The workflow and key results of this project are summarized as follows.Adjustment of the GLGC meta-GWAS summary statistics, in order to derive a worldwide meta-GWAS that is independent of the three cohorts in our study, namely ORCADES, KORCULA and VIS. Single-trait GWAS in three cohorts of ORCADES, KORCULA, and VIS. In each cohort, we performed GWAS of each of the four lipid-traits, HDL, LDL, triglycerides (TG), and total cholesterol (TC). Each phenotype was adjusted for sex and age and inverse-Gaussian transformed to be standard-normally distributed. The population structure was corrected using a linear mixed model (procedure R/GenABEL/polygenic). Thereafter, the GRAMMAR+ transformed residuals were tested against each 1000 Genomes imputed variant to obtain effects and standard errors estimates (package R/VariABEL) in each cohort... (read more) Pre-requisites for students: Basic knowledge of R is required. Knowledge/understanding of basic machine learning methods will be useful.Network analysis and predictionProject leader: Mar Rodríguez-Girondo, Renaud TissierStudents: Andrea Gelemanovic, Alexander Grischenko, Olga Sigalova, Kristína VozárikováOutline: In this project we investigated several methods for network construction in omic applications and we explored different strategies to incorporate grouping information coming from network analysis in the construction of prediction models. Specifically, we used data from ~700 individuals from the Vis island, obtained from the Croatian National Biobank (10.001 Dalmatians). In the first part of the project, we applied three different methods for network construction (through intensity matrix estimation) of lipidomics and glycomics: a) weighted correlation network analysis (WGCNA), b) gaussian graphical modelling, c) gaussian graphical modelling with ridge penalization. The three methods aim to unravel real direct relations between specific features (lipids or glycans), and to discard indirect relations. We evaluated the ability of the resulting networks to represent a gold-standard grouping based on biological pathway distances, by calculating the network modularity coefficient and its significance by two different permutation procedures. The three methods presented good and similar performance to separate the different families of lipids, while they failed to capture structure in the glycan source. In the second part of the project, we investigated two strategies to incorporate grouping information in regularized regression models using the body mass index (BMI) as outcome to predict and the lipidomic data as predictors. Standard ridge and lasso methods were compared to two different strategies to include grouping information coming from the network analysis. First of all, we identified modules (groups) of lipids by applying hierarchical clustering to each of the intensity matrices. The first strategy we considered was a variable reduction approach by selecting form one to three “hub” lipids in each of the groups (the hubs were defined as the lipids with the strongest connectivity, i.e. those with the strongest connections with other lipids in the same module, within each group), and then we applied standard regression techniques. The second strategy was based on penalization but taking the grouping structure into account, by using group lasso regression. We considered the groups derived from our data-driven strategy based on network construction plus hierarchical clustering and we also evaluated the performance of group lasso when using a gold-standard grouping based on a priori biological knowledge. All the prediction models were evaluated in terms of calibration based on predictive sum of squares and all the models were double cross-validated (the network construction was also included in the cross-validation loop). Our results showed that group lasso based on partial correlation may outperform the group lasso model based on biological knowledge, while the hub selection strategy appears to be misleading. Anyway, we were not able to substantially improve the standard Lasso model in our application. Further research derived from these results is to compare the proposed methodology in other settings, with omic data of higher dimension and different correlation structure. Experimental Design Leaders: dr. Ivo Ugrina, mr. Yakov TsepilovOutline: The aim of this project to get acquainted with basics of experimental design, with the emphasis on randomized and block-randomized designs, introduce the idea of technical and biological variation, biological and technical standards, replicates and similar. Special consideration was to be given to the identification of possible sources of variation and to the exploration of the effects of normalization procedures (like quantile normalization or median normalization) and the introduction of batch correction techniques (like linear mixed effect models and empirical Bayes methods)... (read more) |