1. Gathering data for processing

1.1. Names of humnan membrane proteins and their genes

Downlooad the protein lists from UniProt, using the following query: annotation:(type:transmem) AND organism:"Homo sapiens (Human) [9606]. Alternatively use this link to open the search results page.

Ensure to click Columns button to add Topological domain into the columns to be displayed, because this column will be needed for analyzing immunohistochemistry data. Click Download to save the results, choosing the format Tab-separated. The approximate downloaded file size is 1.4 MB. Unzip the file and rename it to Membrane-UniProt3.tab for import.

1.2. Data from HPA

Download immunohistochemistry (IHC) values for all normal tissues from The Human Protein Atlas Consortium, or directly from this link. Similarly, download the staining profiles for proteins in human tumor (including colorectal cancer) tissue based on immunohistochemisty from link. Unzip both files.

1.3. Data from EVPedia

Check the EVpedia site and download all proteins in the CVS format. Unzip and rename the file as EVpedia.cvs. Or use the pre-downloaded file.

1.4 Data load into R
rm(list=ls(all=TRUE)) # Clean all variables

cancerName = '^colorectal' # Cancer type
setwd("/Volumes/Work/Markerwork/") # set the working directory)
uniprot = read.delim('Membrane-UniProt3.tab', sep='\t', header = TRUE, stringsAsFactors = FALSE)
normalTissueIHC = read.delim('normal_tissue.tsv', sep='\t', header = TRUE, stringsAsFactors = FALSE)
tummorTissueIHC = read.delim('pathology.tsv', sep='\t', header = TRUE, stringsAsFactors = FALSE)
# Choose only the target cancer type
cancerTissueIHC = tummorTissueIHC[grep(cancerName, tummorTissueIHC$Cancer), ] 
evpedia = read.delim('evpedia.csv', sep=',', header = TRUE, stringsAsFactors = FALSE)
genesInEvpedia = unique(evpedia$UniProt.name)

2. Data processing

2.1. Extract genes for processing

We extract unique gene names associated with membrane proteins from UniProt. The total number of genes is checked.

membraneGenes = unique(unlist(strsplit(uniprot$Gene.names, split = ' ')))
length(membraneGenes)
[1] 6566

Proteins with extracellular domain were further selected from all gene names for both transcriptome and immunohistochemical analyses based on the description of the category Topological.domain from Uniprot database. First, we select the proteins from the uniprot data if they have Extracellular in the value of the Topological.domain column, then get their gene names (including synonyms).

proteinsWithExtracelluarTopoDomain = uniprot[grep('Extracellular.', uniprot$Topological.domain), ]
genesForExtracelluarProteins = unique(unlist(strsplit(proteinsWithExtracelluarTopoDomain$Gene.names, split = ' ')))
length(genesForExtracelluarProteins)
[1] 2913
2.2. Cancer data

Cancer data are processed to obtain the mean staining score. Numerical values are assigned to staining levels,[High Mediuem Low Not.detected] = [3 2 1 0]; this assignment can be modified.

scores = diag(c(3, 2, 1, 0))
cancerTissueIHC$meanScores = rowMeans(as.matrix(cancerTissueIHC[, c("High", "Medium", "Low", "Not.detected")]) %*% scores)

By inspecting the number of unique gene names, we found the length of unique gene names is shorter than the score list.

cancerTissueIHC[duplicated(cancerTissueIHC$Gene.name),]$Gene.name # 13 duplicated gene names
 [1] "SDHD"     "ARL14EPL" "COG8"     "C2orf61"  "MATR3"    "HIST1H3D" "TXNRD3NB" "PRSS50"   "LYNX1"    "SCO2"     "BTBD8"    "RABGEF1"  "LYNX1"   

In addition to 13 records with duplicated gene names (with identical or different mean IHC scores), there are also 4292 NA values. These two data quality issues can be resolved by aggreating the mean scores by gene names while removing NA values (default option) to get the sanitized mean IHC scores for each gene for colorectal cancers.

sanitizedcancerTissueIHC = aggregate(meanScores ~ Gene.name, cancerTissueIHC, mean)
genesSelectedFromCancer = intersect(genesForExtracelluarProteins, sanitizedcancerTissueIHC$Gene.name)
genesSelectedForAnalysis = intersect(genesSelectedFromCancer, genesInEvpedia) 
## genesSelectedForAnalysis = genesSelectedFromCancer
2.3. Normal tissue data

We perfome the similar processing with the normal tissue data. As the dataset is large, we limit genes used in the tumor.

genesInNormalTissueIHC = unique(normalTissueIHC$Gene.name)
genesSelectedForAnalysis = intersect(genesSelectedForAnalysis, genesInNormalTissueIHC)
selectedNormalTissueIHCRecords = normalTissueIHC[normalTissueIHC$Gene.name %in% genesSelectedForAnalysis,]

Normal tissue data have multiple IHC scores for a given gene, as the gene is present in different types of tissue. We rearrange normal tissue data to count the occurence of the IHC score for each gene.

selectedNormalTissueIHC = table(selectedNormalTissueIHCRecords$Gene.name, selectedNormalTissueIHCRecords$Level)
selectedNormalTissueIHC.meanScores = rowMeans(as.matrix(selectedNormalTissueIHC[, c("High", "Medium", "Low", "Not detected")]) %*% scores)

3. Plotting data

3.1. Linear data scaling

This routine uses a regular nomralization [min max] = [0 1].

library(scales)
scaledTumorTissueIHCScores = rescale(sanitizedcancerTissueIHC$meanScores)
names(scaledTumorTissueIHCScores) = sanitizedcancerTissueIHC$Gene.name
scaledSelectedTumorIHCScores = scaledTumorTissueIHCScores[genesSelectedForAnalysis]
scaledNormalTissueIHCScores = rescale(selectedNormalTissueIHC.meanScores)
scaledNormalTissueIHCScores = scaledNormalTissueIHCScores[genesSelectedForAnalysis]

Plotting data:

library(ggplot2)
library(ggrepel)
library(scico)

d = data.frame(x = scaledSelectedTumorIHCScores, y = scaledNormalTissueIHCScores, z = scaledSelectedTumorIHCScores - scaledNormalTissueIHCScores)
colormap = scico(256, palette = "vik", direction=1)

ggplot(d, aes(x, y)) + geom_point(aes(color = z)) + 
  labs(x = "Staining score in CRC tissue (a.u.)", y = "Staining score in normal tissue (a.u.)") + 
  labs(colour = "Difference\n[tumor - normal]") + 
  scale_colour_gradientn(limits=c(-1, 1), colours = colormap) + 
  geom_text_repel(aes(label=ifelse(z > 1.5, genesSelectedForAnalysis,'')), size = 3) +
  theme(axis.title = element_text(size = 15), axis.text = element_text(size = 14)) +
  coord_fixed(1)

3.2 Z-score

This routine scales the data to normal distribution.

mu = mean(sanitizedcancerTissueIHC$meanScores)
sigma = sd(sanitizedcancerTissueIHC$meanScores)
scaledTumorTissueZScores = (sanitizedcancerTissueIHC$meanScores - mu)/sigma 
names(scaledTumorTissueZScores) = sanitizedcancerTissueIHC$Gene.name
scaledSelectedTumorZScores = scaledTumorTissueZScores[genesSelectedForAnalysis]

mu = mean(selectedNormalTissueIHC.meanScores)
sigma = sd(selectedNormalTissueIHC.meanScores)

scaledNormalTissueZScores = (selectedNormalTissueIHC.meanScores - mu)/sigma
scaledNormalTissueZScores = scaledNormalTissueZScores[genesSelectedForAnalysis]

Plotting data:

dz = data.frame(x = scaledSelectedTumorZScores, y = scaledNormalTissueZScores, z = scaledSelectedTumorZScores - scaledNormalTissueZScores)

ggplot(dz, aes(x, y)) + geom_point(aes(color = z)) + 
  labs(x = "Z score in CRC tissue (a.u.)", y = "Z score in normal tissue (a.u.)") + 
  labs(colour = "Difference\n[tumor - normal]") + 
  xlim(-1.5, 2.5) + ylim(-1.5, 2.5) +
  scale_colour_gradientn(limits=c(-3, 3), colours = colormap) + 
  geom_text_repel(aes(label=ifelse(z > 1.5,genesSelectedForAnalysis,'')), size = 3) + 
  theme(axis.title = element_text(size = 15), axis.text = element_text(size = 14))+
  coord_fixed(1)

---
title: "CRC marker selection"
output: html_notebook
---
#### 1. Gathering data for processing
##### 1.1. Names of humnan membrane proteins and their genes

Downlooad the protein lists from UniProt, using the following query: 
`annotation:(type:transmem) AND organism:"Homo sapiens (Human) [9606]`. Alternatively use this [link](https://www.uniprot.org/uniprot/?query=annotation%3A%28type%3Atransmem%29+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22&sort=score) to open the search results page. 

Ensure to click `Columns` button to add `Topological domain` into the columns to be displayed, because this column will be needed for analyzing immunohistochemistry data. Click `Download` to save the results, choosing the format `Tab-separated`. The approximate downloaded file size is 1.4 MB. Unzip the file and rename it to `Membrane-UniProt3.tab` for import.

##### 1.2. Data from HPA
Download immunohistochemistry (IHC) values for all normal tissues from The Human Protein Atlas Consortium, or directly from this [link](https://www.proteinatlas.org/download/normal_tissue.tsv.zip). Similarly, download the staining profiles for proteins in human tumor (including colorectal cancer) tissue based on immunohistochemisty from [link](https://www.proteinatlas.org/download/pathology.tsv.zip). Unzip both files. 

##### 1.3. Data from EVPedia
Check the [EVpedia site](http://student4.postech.ac.kr/evpedia2_xe/xe/index.php?mid=Top_protein_human) and download all proteins in the CVS format. Unzip and rename the file as `EVpedia.cvs`. Or use the pre-downloaded file.

##### 1.4 Data load into R
```{r message=FALSE, warning=FALSE, paged.print=FALSE}
rm(list=ls(all=TRUE)) # Clean all variables

cancerName = '^colorectal' # Cancer type
#setwd("/Volumes/Work/Markerwork/") # set the working directory
setwd("/Volumes/Work/Redhawk/Dropbox (Personal)/Current Work/Papers/2018 CRC exosome/08 Final submission/01 Submission file/Markerselection")
uniprot = read.delim('Membrane-UniProt3.tab', sep='\t', header = TRUE, stringsAsFactors = FALSE)
normalTissueIHC = read.delim('normal_tissue.tsv', sep='\t', header = TRUE, stringsAsFactors = FALSE)
tummorTissueIHC = read.delim('pathology.tsv', sep='\t', header = TRUE, stringsAsFactors = FALSE)
# Choose only the target cancer type
cancerTissueIHC = tummorTissueIHC[grep(cancerName, tummorTissueIHC$Cancer), ] 
evpedia = read.delim('evpedia.csv', sep=',', header = TRUE, stringsAsFactors = FALSE)
genesInEvpedia = unique(evpedia$UniProt.name)
```

#### 2. Data processing 
##### 2.1. Extract genes for processing 
We extract unique gene names associated with membrane proteins from UniProt. The total number of genes is checked.
```{r message=FALSE, warning=FALSE}
membraneGenes = unique(unlist(strsplit(uniprot$Gene.names, split = ' ')))
length(membraneGenes)
```
Proteins with extracellular domain were further selected from all gene names for both transcriptome and immunohistochemical analyses based on the description of the category `Topological.domain` from Uniprot database. First, we select the proteins from the uniprot data if they have `Extracellular` in the value of the `Topological.domain` column, then get their gene names (including synonyms).

```{r}
proteinsWithExtracelluarTopoDomain = uniprot[grep('Extracellular.', uniprot$Topological.domain), ]
genesForExtracelluarProteins = unique(unlist(strsplit(proteinsWithExtracelluarTopoDomain$Gene.names, split = ' ')))
length(genesForExtracelluarProteins)
```
##### 2.2. Cancer data
Cancer data are processed to obtain the mean staining score. Numerical values are assigned to staining levels,`[High Mediuem Low Not.detected] = [3 2 1 0];` this assignment can be modified. 

```{r message=FALSE, warning=FALSE}
scores = diag(c(3, 2, 1, 0))
cancerTissueIHC$meanScores = rowMeans(as.matrix(cancerTissueIHC[, c("High", "Medium", "Low", "Not.detected")]) %*% scores)
```

By inspecting the number of unique gene names, we found the length of unique gene names is shorter than the score list.

```{r message=FALSE, warning=FALSE}
cancerTissueIHC[duplicated(cancerTissueIHC$Gene.name),]$Gene.name # 13 duplicated gene names
```
In addition to 13 records with duplicated gene names (with identical or different mean IHC scores), there are also 4292 NA values. These two data quality issues can be resolved by aggreating the mean scores by gene names while removing NA values (default option) to get the sanitized mean IHC scores for each gene for colorectal cancers.
```{r message=FALSE, warning=FALSE}
sanitizedcancerTissueIHC = aggregate(meanScores ~ Gene.name, cancerTissueIHC, mean)
genesSelectedFromCancer = intersect(genesForExtracelluarProteins, sanitizedcancerTissueIHC$Gene.name)
genesSelectedForAnalysis = intersect(genesSelectedFromCancer, genesInEvpedia) 
## genesSelectedForAnalysis = genesSelectedFromCancer
```
##### 2.3. Normal tissue data
We perfome the similar processing with the normal tissue data. As the dataset is large, we limit genes used in the tumor. 

```{r}
genesInNormalTissueIHC = unique(normalTissueIHC$Gene.name)
genesSelectedForAnalysis = intersect(genesSelectedForAnalysis, genesInNormalTissueIHC)
selectedNormalTissueIHCRecords = normalTissueIHC[normalTissueIHC$Gene.name %in% genesSelectedForAnalysis,]
```
Normal tissue data have multiple IHC scores for a given gene, as the gene is present in different types of tissue. We rearrange normal tissue data to count the occurence of the IHC score for each gene.

```{r}
selectedNormalTissueIHC = table(selectedNormalTissueIHCRecords$Gene.name, selectedNormalTissueIHCRecords$Level)
selectedNormalTissueIHC.meanScores = rowMeans(as.matrix(selectedNormalTissueIHC[, c("High", "Medium", "Low", "Not detected")]) %*% scores)
```

#### 3. Plotting data
##### 3.1. Linear data scaling
This routine uses a regular nomralization `[min max] =  [0 1]`. 
```{r message=FALSE, warning=FALSE}
library(scales)
scaledTumorTissueIHCScores = rescale(sanitizedcancerTissueIHC$meanScores)
names(scaledTumorTissueIHCScores) = sanitizedcancerTissueIHC$Gene.name
scaledSelectedTumorIHCScores = scaledTumorTissueIHCScores[genesSelectedForAnalysis]
scaledNormalTissueIHCScores = rescale(selectedNormalTissueIHC.meanScores)
scaledNormalTissueIHCScores = scaledNormalTissueIHCScores[genesSelectedForAnalysis]
```
Plotting data:
```{r message=FALSE, warning=FALSE}
library(ggplot2)
library(ggrepel)
library(scico)

d = data.frame(x = scaledSelectedTumorIHCScores, y = scaledNormalTissueIHCScores, z = scaledSelectedTumorIHCScores - scaledNormalTissueIHCScores)
colormap = scico(256, palette = "vik", direction=1)

ggplot(d, aes(x, y)) + geom_point(aes(color = z)) + 
  labs(x = "Staining score in CRC tissue (a.u.)", y = "Staining score in normal tissue (a.u.)") + 
  labs(colour = "Difference\n[tumor - normal]") + 
  scale_colour_gradientn(limits=c(-1, 1), colours = colormap) + 
  geom_text_repel(aes(label=ifelse(z > 1.5, genesSelectedForAnalysis,'')), size = 3) +
  theme(axis.title = element_text(size = 15), axis.text = element_text(size = 14)) +
  coord_fixed(1)
```
##### 3.2 Z-score
This routine scales the data to normal distribution.
```{r message=FALSE, warning=FALSE}
mu = mean(sanitizedcancerTissueIHC$meanScores)
sigma = sd(sanitizedcancerTissueIHC$meanScores)
scaledTumorTissueZScores = (sanitizedcancerTissueIHC$meanScores - mu)/sigma 
names(scaledTumorTissueZScores) = sanitizedcancerTissueIHC$Gene.name
scaledSelectedTumorZScores = scaledTumorTissueZScores[genesSelectedForAnalysis]

mu = mean(selectedNormalTissueIHC.meanScores)
sigma = sd(selectedNormalTissueIHC.meanScores)

scaledNormalTissueZScores = (selectedNormalTissueIHC.meanScores - mu)/sigma
scaledNormalTissueZScores = scaledNormalTissueZScores[genesSelectedForAnalysis]
```
Plotting data:
```{r message=FALSE, warning=FALSE}
dz = data.frame(x = scaledSelectedTumorZScores, y = scaledNormalTissueZScores, z = scaledSelectedTumorZScores - scaledNormalTissueZScores)

ggplot(dz, aes(x, y)) + geom_point(aes(color = z)) + 
  labs(x = "Z score in CRC tissue (a.u.)", y = "Z score in normal tissue (a.u.)") + 
  labs(colour = "Difference\n[tumor - normal]") + 
  xlim(-1.5, 2.5) + ylim(-1.5, 2.5) +
  scale_colour_gradientn(limits=c(-3, 3), colours = colormap) + 
  geom_text_repel(aes(label=ifelse(z > 1.5,genesSelectedForAnalysis,'')), size = 3) + 
  theme(axis.title = element_text(size = 15), axis.text = element_text(size = 14))+
  coord_fixed(1)
```