Last updated: 2020-07-17

Checks: 6 1

Knit directory: psychencode/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200622) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute relative
/Users/sabrinami/Github/psychencode .

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a2d4866. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  .DS_Store
    Untracked:  models/
    Untracked:  output/.DS_Store
    Untracked:  output/test_results/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/test_alcdep.Rmd) and HTML (docs/test_alcdep.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd cd613c1 sabrina-mi 2020-07-14 don’t know whats going on with clozuk gwas, but pgc gwas works well
Rmd d9957ad sabrina-mi 2020-07-10 GTEx model working
Rmd 643fdaa sabrina-mi 2020-07-08 predixcan tests

Download Data

I used a GWAS of alcohol dependence to test S-PrediXcan on the psychencode model against PredictDB models trained using GTEx v7 data, subsampled to use European samples. The GWAS was performed on unrelated genotyped individuals of European ancestry, with 8,485 cases and 20,272 controls. More info on the study: https://www.biorxiv.org/content/early/2018/03/10/257311. Download the GWAS: https://www.med.unc.edu/pgc/download-results/, filter by Alcohol, the Alcohol Dependence. The GTEx models can be downloaded on predictdb.org or with wget "https://zenodo.org/record/3572799/files/GTEx-V7_HapMap-2017-11-29.tar.gz"

Definitions

conda activate imlabtools
METAXCAN=/Users/sabrinami/Github/MetaXcan/software
MODEL=/Users/sabrinami/Github/psychencode/models
RESULTS=/Users/sabrinami/Github/psychencode/output/test_results
DATA=/Users/sabrinami/Desktop/psychencode_test_data

MetaXcan

Run Predict

printf "Predict expression with Psychencode model\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $MODEL/psychencode_model/psychencode.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/predixcan/psychencode__predict.txt \
--prediction_summary_output $RESULTS/predixcan/psychencode__summary.txt \
--verbosity 9 \
--throw
printf "Predict expression with Brain Cortex model\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $MODEL/GTEx-V7-en/gtex_v7_Brain_Cortex_imputed_europeans_tw_0.5_signif.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/predixcan/Brain_Cortex__predict.txt \
--prediction_summary_output $RESULTS/predixcan/Brain_Cortex__summary.txt \
--verbosity 9 \
--throw
printf "Predict expression with Whole Blood model\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $MODEL/GTEx-V7-en/gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/predixcan/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/predixcan/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

Run S-PrediXcan

Run S-PrediXcan on alcohol dependence GWAS with psychencode model, then repeat with GTEx-V7-en Brain Cortex and Whole Blood models.

python3 $METAXCAN/SPrediXcan.py --gwas_file $DATA/pgc_alcdep.aug2018_release/pgc_alcdep.eur_unrel_genotyped.aug2018_release.txt.gz \
--model_db_path $MODEL/psychencode_model/psychencode.db \
--covariance $MODEL/psychencode_model/psychencode.txt.gz \
--output_file $RESULTS/spredixcan/eqtl/pgc_alcdep/pgc_alcdep.eur_psychencode.csv \
--or_column OR \
--pvalue_column P \
--snp_column SNP \
--non_effect_allele_column A2 \
--effect_allele_column A1 \
--separator " "
printf "GTEx Brain Cortex model\n\n"
python3 $METAXCAN/SPrediXcan.py --gwas_file $DATA/pgc_alcdep.aug2018_release/pgc_alcdep.eur_unrel_genotyped.aug2018_release.txt.gz \
--model_db_path $MODEL/GTEx-V7-en/gtex_v7_Brain_Cortex_imputed_europeans_tw_0.5_signif.db \
--covariance $MODEL/GTEx-V7-en/gtex_v7_Brain_Cortex_imputed_eur_covariances.txt.gz \
--overwrite --throw --verbosity 1 \
--output_file $RESULTS/spredixcan/eqtl/pgc_alcdep/pgc_alcdep.eur_Brain_Cortex.csv \
--or_column OR \
--pvalue_column P \
--snp_column SNP \
--non_effect_allele_column A2 \
--effect_allele_column A1 > alcdep_Brain_Cortex_output.txt 2>&1
printf "GTEx Whole Blood model\n\n"
python3 $METAXCAN/SPrediXcan.py --gwas_file $DATA/pgc_alcdep.aug2018_release/pgc_alcdep.eur_unrel_genotyped.aug2018_release.txt.gz \
--model_db_path $MODEL/GTEx-V7-en/gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db \
--covariance $MODEL/GTEx-V7-en/gtex_v7_Whole_Blood_imputed_eur_covariances.txt.gz \
--output_file $RESULTS/spredixcan/eqtl/pgc_alcdep/pgc_alcdep.eur_Whole_Blood.csv \
--or_column OR \
--pvalue_column P \
--snp_column SNP \
--non_effect_allele_column A2 \
--effect_allele_column A1 \
--separator " "

Compare Results

Definitions

PRE="/Users/sabrinami/Github/psychencode"
DATA="/Users/sabrinami/Desktop/psychencode_test_data"
RESULTS=glue::glue("{PRE}/output/test_results")
CODE=glue::glue("{PRE}/code")
source(glue::glue("{CODE}/load_data_functions.R"))
source(glue::glue("{CODE}/plotting_utils_functions.R"))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(qqman))

gencode_df = load_gencode_df()

Prediction Performance

Load prediction outputs into R.

prediction_fp_Whole_Blood = glue::glue("{RESULTS}/predixcan/Whole_Blood__predict.txt")


## Read the Predict.py output into a dataframe. This function reorganizes the data and adds gene names.
predicted_expression_Whole_Blood = load_predicted_expression(prediction_fp_Whole_Blood, gencode_df)

head(predicted_expression_Whole_Blood)
# A tibble: 6 x 7
  FID    IID    gene_id    predicted_expres… gene_name  chromosome region_id    
  <fct>  <fct>  <chr>                  <dbl> <fct>      <fct>      <fct>        
1 HG000… HG000… ENSG00000…           -1.08   RP11-206L… chr1       region_chr1_…
2 HG000… HG000… ENSG00000…           -0.302  ATAD3C     chr1       region_chr1_…
3 HG000… HG000… ENSG00000…           -0.265  FAM87B     chr1       region_chr1_…
4 HG000… HG000… ENSG00000…           -1.35   LINC00115  chr1       region_chr1_…
5 HG000… HG000… ENSG00000…           -2.67   TAS1R3     chr1       region_chr1_…
6 HG000… HG000… ENSG00000…           -0.0710 SSU72      chr1       region_chr1_…
## read summary of prediction, number of SNPs per gene, cross validated prediction performance
prediction_summary_Whole_Blood = load_prediction_summary(glue::glue("{RESULTS}/predixcan/Whole_Blood__summary.txt"), gencode_df)
## number of genes with a prediction model
dim(prediction_summary_Whole_Blood)
[1] 6297    8
head(prediction_summary_Whole_Blood)
             gene n_snps_in_model n_snps_used pred_perf_r2 pred_perf_pval
1 ENSG00000000457              19          19   0.06784837   1.786471e-06
2 ENSG00000000460              31          31   0.04532282   1.499214e-04
3 ENSG00000000938               9           9   0.01759910   1.954026e-02
4 ENSG00000001084               2          NA   0.02652882   3.875334e-03
5 ENSG00000001167              17          NA   0.03930124   4.607071e-04
6 ENSG00000001460              11          11   0.04711175   1.132466e-04
  gene_name chromosome                       region_id
1     SCYL3       chr1 region_chr1_169117086_170588635
2  C1orf112       chr1 region_chr1_169117086_170588635
3       FGR       chr1   region_chr1_27075376_29689034
4      GCLC       chr6   region_chr6_53414304_55603472
5      NFYA       chr6   region_chr6_40377376_42070983
6     STPG1       chr1   region_chr1_23594100_25190354
prediction_fp_psychencode = glue::glue("{RESULTS}/predixcan/psychencode__predict.txt")
predicted_expression_psychencode = load_predicted_expression(prediction_fp_psychencode, gencode_df)

head(predicted_expression_psychencode)
# A tibble: 6 x 7
  FID    IID    gene_id    predicted_expres… gene_name chromosome region_id     
  <fct>  <fct>  <chr>                  <dbl> <fct>     <fct>      <fct>         
1 HG000… HG000… ENSG00000…          -0.0890  ISG15     chr1       region_chr1_1…
2 HG000… HG000… ENSG00000…          -0.0231  HES4      chr1       region_chr1_1…
3 HG000… HG000… ENSG00000…          -0.0429  ATAD3B    chr1       region_chr1_1…
4 HG000… HG000… ENSG00000…          -0.00230 MXRA8     chr1       region_chr1_1…
5 HG000… HG000… ENSG00000…          -0.0121  PERM1     chr1       region_chr1_1…
6 HG000… HG000… ENSG00000…           0.170   ATAD3C    chr1       region_chr1_1…
prediction_summary_psychencode = load_prediction_summary(glue::glue("{RESULTS}/predixcan/psychencode__summary.txt"), gencode_df)
dim(prediction_summary_psychencode)
[1] 14378     8
head(prediction_summary_psychencode)
             gene n_snps_in_model n_snps_used pred_perf_r2 pred_perf_pval
1 ENSG00000000457              26          26           NA             NA
2 ENSG00000000460              48          48           NA             NA
3 ENSG00000000971              21          19           NA             NA
4 ENSG00000001036              14          NA           NA             NA
5 ENSG00000001084              41          NA           NA             NA
6 ENSG00000001167               9          NA           NA             NA
  gene_name chromosome                       region_id
1     SCYL3       chr1 region_chr1_169117086_170588635
2  C1orf112       chr1 region_chr1_169117086_170588635
3       CFH       chr1 region_chr1_196207071_197342384
4     FUCA2       chr6 region_chr6_141967342_144998674
5      GCLC       chr6   region_chr6_53414304_55603472
6      NFYA       chr6   region_chr6_40377376_42070983
prediction_fp_Brain_Cortex = glue::glue("{RESULTS}/predixcan/Brain_Cortex__predict.txt")

predicted_expression_Brain_Cortex = load_predicted_expression(prediction_fp_Brain_Cortex, gencode_df)

head(predicted_expression_Brain_Cortex)
# A tibble: 6 x 7
  FID    IID    gene_id    predicted_expres… gene_name chromosome region_id     
  <fct>  <fct>  <chr>                  <dbl> <fct>     <fct>      <fct>         
1 HG000… HG000… ENSG00000…            0.0873 TNFRSF4   chr1       region_chr1_1…
2 HG000… HG000… ENSG00000…           -0.883  LINC00115 chr1       region_chr1_1…
3 HG000… HG000… ENSG00000…            0.0372 SLC35E2   chr1       region_chr1_1…
4 HG000… HG000… ENSG00000…            0.773  RP1-283E… chr1       region_chr1_1…
5 HG000… HG000… ENSG00000…            0.487  NDUFB4P8  chr1       region_chr1_1…
6 HG000… HG000… ENSG00000…            1.89   MXRA8     chr1       region_chr1_1…
prediction_summary_Brain_Cortex = load_prediction_summary(glue::glue("{RESULTS}/predixcan/Brain_Cortex__summary.txt"), gencode_df)
dim(prediction_summary_Brain_Cortex)
[1] 4332    8
head(prediction_summary_Brain_Cortex)
             gene n_snps_in_model n_snps_used pred_perf_r2 pred_perf_pval
1 ENSG00000000419             115          NA   0.05285449   1.509551e-02
2 ENSG00000001084              10          NA   0.11134301   3.222470e-04
3 ENSG00000002016              21          NA   0.26224973   1.146342e-09
4 ENSG00000002549               7          NA   0.07884824   2.493767e-03
5 ENSG00000002933              40          NA   0.12714419   4.978505e-05
6 ENSG00000003056              26          NA   0.04939641   1.950023e-02
  gene_name chromosome                       region_id
1      DPM1      chr20  region_chr20_50623121_53856010
2      GCLC       chr6   region_chr6_53414304_55603472
3     RAD52      chr12       region_chr12_44631_971165
4      LAP3       chr4   region_chr4_17381699_18840251
5  TMEM176A       chr7 region_chr7_150143569_151014417
6      M6PR      chr12   region_chr12_8878799_10328935

Load observed expression.

observed_expression = read.csv(glue::glue("{DATA}/GEUVADIS.observed_df.csv"))

Merge predicted expression with observed expression, and calculate correlation between predicted and observed expression for each model.

merged_Whole_Blood = merge(predicted_expression_Whole_Blood, observed_expression, by = c("gene_id", "IID"))
# calculate correlation between predicted and observed expression for each gene
correlation_Whole_Blood = merged_Whole_Blood %>% group_by(gene_id) %>% summarise(corr = cor(predicted_expression, observed_expression, method = "spearman"))
summary(correlation_Whole_Blood)
   gene_id               corr          
 Length:560         Min.   :-0.495534  
 Class :character   1st Qu.:-0.008388  
 Mode  :character   Median : 0.059699  
                    Mean   : 0.076672  
                    3rd Qu.: 0.137423  
                    Max.   : 0.831743  
merged_psychencode = merge(predicted_expression_psychencode, observed_expression, by = c("gene_id", "IID"))
# calculate correlation between predicted and observed expression for each gene
correlation_psychencode = merged_psychencode %>% group_by(gene_id) %>% summarise(corr = cor(predicted_expression, observed_expression, method = "spearman"))
summary(correlation_psychencode)
   gene_id               corr         
 Length:1151        Min.   :-0.55006  
 Class :character   1st Qu.:-0.02436  
 Mode  :character   Median : 0.02961  
                    Mean   : 0.04690  
                    3rd Qu.: 0.09639  
                    Max.   : 0.86425  
merged_Brain_Cortex = merge(predicted_expression_Brain_Cortex, observed_expression, by = c("gene_id", "IID"))
# calculate correlation between predicted and observed expression for each gene
correlation_Brain_Cortex = merged_Brain_Cortex %>% group_by(gene_id) %>% summarise(corr = cor(predicted_expression, observed_expression, method = "spearman"))
summary(correlation_Brain_Cortex)
   gene_id               corr         
 Length:334         Min.   :-0.53217  
 Class :character   1st Qu.:-0.02136  
 Mode  :character   Median : 0.03431  
                    Mean   : 0.06777  
                    3rd Qu.: 0.12802  
                    Max.   : 0.84747  

S-PrediXcan Results

Load S-PrediXcan association output into R, and check for significant genes.

spredixcan_association_psychencode = load_spredixcan_association(glue::glue("{RESULTS}/spredixcan/eqtl/pgc_alcdep/pgc_alcdep.eur_psychencode.csv"), gencode_df)
dim(spredixcan_association_psychencode)
[1] 14348    14
spredixcan_association_psychencode %>% filter(pvalue < 0.05/nrow(spredixcan_association_psychencode))
 [1] gene            zscore          effect_size     pvalue         
 [5] var_g           pred_perf_r2    pred_perf_pval  pred_perf_qval 
 [9] n_snps_used     n_snps_in_cov   n_snps_in_model gene_name      
[13] chromosome      region_id      
<0 rows> (or 0-length row.names)
spredixcan_association_Brain_Cortex = load_spredixcan_association(glue::glue("{RESULTS}/spredixcan/eqtl/pgc_alcdep/pgc_alcdep.eur_Brain_Cortex.csv"), gencode_df)
dim(spredixcan_association_Brain_Cortex)
[1] 4323   14
spredixcan_association_Brain_Cortex %>% filter(pvalue < 0.05/nrow(spredixcan_association_Brain_Cortex))
 [1] gene            zscore          effect_size     pvalue         
 [5] var_g           pred_perf_r2    pred_perf_pval  pred_perf_qval 
 [9] n_snps_used     n_snps_in_cov   n_snps_in_model gene_name      
[13] chromosome      region_id      
<0 rows> (or 0-length row.names)
spredixcan_association_Brain_Cortex %>% arrange(pvalue) %>% select(gene, pvalue) %>% head
             gene       pvalue
1 ENSG00000230528 0.0006159278
2 ENSG00000153291 0.0009708806
3 ENSG00000100395 0.0018640389
4 ENSG00000158270 0.0027746399
5 ENSG00000215504 0.0030597439
6 ENSG00000133107 0.0031635336
spredixcan_association_Whole_Blood = load_spredixcan_association(glue::glue("{RESULTS}/spredixcan/eqtl/pgc_alcdep/pgc_alcdep.eur_Whole_Blood.csv"), gencode_df)
dim(spredixcan_association_Whole_Blood)
[1] 6289   14
spredixcan_association_Whole_Blood %>% filter(pvalue < 0.05/nrow(spredixcan_association_Whole_Blood))
 [1] gene            zscore          effect_size     pvalue         
 [5] var_g           pred_perf_r2    pred_perf_pval  pred_perf_qval 
 [9] n_snps_used     n_snps_in_cov   n_snps_in_model gene_name      
[13] chromosome      region_id      
<0 rows> (or 0-length row.names)
spredixcan_association_Whole_Blood %>% arrange(pvalue) %>% select(gene, pvalue) %>% head
             gene       pvalue
1 ENSG00000062282 5.877502e-05
2 ENSG00000230207 4.113094e-04
3 ENSG00000115758 5.272995e-04
4 ENSG00000070087 7.088832e-04
5 ENSG00000244691 7.224511e-04
6 ENSG00000111886 7.780464e-04

Plot S-PrediXcan Association

Manhattan plot and qq plot for alcohol dependence GWAS. There’s only one or two significant peaks.

alcdep_GWAS = read.table(glue::glue("{DATA}/pgc_alcdep.aug2018_release/pgc_alcdep.eur_unrel_genotyped.aug2018_release.txt.gz"), header=TRUE, sep=" ")
manhattan(alcdep_GWAS, chr="CHR", bp="BP", snp="SNP", p="P" )
gg_qqplot(alcdep_GWAS$P)

Plot p-values of association output.

spredixcan_association_Brain_Cortex %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)

gg_qqplot(spredixcan_association_Brain_Cortex$pvalue)

spredixcan_association_psychencode %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)
Warning: Removed 32 rows containing non-finite values (stat_bin).

gg_qqplot(spredixcan_association_psychencode$pvalue)

Brain_Cortex_psychencode_association = inner_join(spredixcan_association_Brain_Cortex, spredixcan_association_psychencode, by=c("gene"))
dim(Brain_Cortex_psychencode_association)
[1] 3414   27
Brain_Cortex_psychencode_association %>% ggplot(aes(zscore.x, zscore.y)) + geom_point()
Warning: Removed 6 rows containing missing values (geom_point).

spredixcan_association_Whole_Blood %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)

gg_qqplot(spredixcan_association_Whole_Blood$pvalue)


sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qqman_0.1.4     forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [5] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
 [9] ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       lubridate_1.7.4  lattice_0.20-38  assertthat_0.2.1
 [5] zeallot_0.1.0    rprojroot_1.3-2  digest_0.6.23    utf8_1.1.4      
 [9] R6_2.4.1         cellranger_1.1.0 backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       highr_0.8        pillar_1.4.3    
[17] rlang_0.4.2      readxl_1.3.1     rstudioapi_0.10  whisker_0.4     
[21] rmarkdown_2.1    labeling_0.3     munsell_0.5.0    broom_0.5.3     
[25] compiler_3.6.2   httpuv_1.5.3.1   modelr_0.1.5     xfun_0.12       
[29] pkgconfig_2.0.3  htmltools_0.4.0  tidyselect_0.2.5 workflowr_1.6.2 
[33] fansi_0.4.1      calibrate_1.7.7  crayon_1.3.4     dbplyr_1.4.2    
[37] withr_2.1.2      later_1.0.0      MASS_7.3-51.4    grid_3.6.2      
[41] nlme_3.1-142     jsonlite_1.6     gtable_0.3.0     lifecycle_0.1.0 
[45] DBI_1.1.0        git2r_0.27.1     magrittr_1.5     scales_1.1.0    
[49] cli_2.0.1        stringi_1.4.5    farver_2.0.3     fs_1.3.1        
[53] promises_1.1.0   xml2_1.2.2       generics_0.0.2   vctrs_0.2.1     
[57] tools_3.6.2      glue_1.3.1       hms_0.5.3        yaml_2.2.0      
[61] colorspace_1.4-1 rvest_0.3.5      knitr_1.27       haven_2.2.0