Extracting Demographic and Diagnostic Sample Attributes from NCI’s Genomic Data Commons
Michael Kesling 7/25/2019
Introduction
In order to correctly select the pertinent RNA-Seq samples from The Cancer Genome Atlas (TCGA) project, it’s necessary to access attribute data about the samples from the NCI’s Genomic Data Commons (GDC), where the data are now warehoused (https://gdc.cancer.gov).
A Bioconductor package named ‘GenomicDataCommons’ is utilized here in order to explore and then extract the relevant fields of sample attributes. The ‘BiocManager’ package must first be installed in order to install bioconductor packages.
Note:
Some of the code is not shown on the GitHub website. One needs to look at the R Markdown file to view all of the code.
Understanding TCGA Identifiers
Each RNA-Seq file that was generated by TCGA, whether it was obtained directly through GDC, from the Wang publication of cross-study normalized data, or other studies, is identified through its TCGA identifier. This identifier gives important information regarding the type of sample used, and is essential for mapping attribute data on GDC to other published studies.
We’ll look at the structure of the TCGA identifier to determine what information we need to uniquely identify a human patient.
Label | Identifier_for | Value | Value_Description | Possible_Values |
---|---|---|---|---|
TSS | Tissue Source Site | 2 | GBM (brain tumor) sample from MD Anderson | see https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes for complete list |
Participant | Study Participant (Patient) | 14 | The 14th participant from the GBM Study of MD Anderson | Any alph-numeric value |
Sample | Sample Type | 1 | A solid tumor | Tumor types range from 01-09, normal types from 10-19, and control samples from 20-29. See https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes |
While other fields in the TCGA identifier may be important, the “TSS”, “Participant”, and “Sample” fields are sufficient for identifying the desired sample attributes of individual patients on GDC.
Exploring GDC Sample Attributes
We using the Vignette for the GenomicDataCommons package to explore and query the GDC data.
As the sample attributes of Demographics (characteristics of the human patient such as age, sex, race, etc.) and Diagnoses (characteristics of the tumor/healthy sample such as stage of cancer, new diagnosis/recurrence, etc.) all fall under the umbrella of the cases, we’ll access our data through the GenomicDataCommons function cases().
Let’s start out by simply asking how many different cases are stored by GDC:
GenomicDataCommons::cases() %>% GenomicDataCommons::count()
## [1] 34893
We see that there are over 34,000 cases as of July 2019.
Identifying Breast Cancer Project ID
As we’re focusing on breast cancer, and the samples are grouped into projects according to the type of cancer, we’ll need to determine the correct identifier for subsetting all cases by those involving breast cancer.
First, we need to determine the exact field and identifier for the Project ID. We’ll use the grep() function to try and find that field.
project_fields <- grep('proj', available_fields('cases'), value=TRUE)
print(project_fields)
## [1] "project.dbgap_accession_number"
## [2] "project.disease_type"
## [3] "project.intended_release_date"
## [4] "project.name"
## [5] "project.primary_site"
## [6] "project.program.dbgap_accession_number"
## [7] "project.program.name"
## [8] "project.program.program_id"
## [9] "project.project_id"
## [10] "project.releasable"
## [11] "project.released"
## [12] "project.state"
## [13] "tissue_source_site.project"
It appears that the exact field name is project.project_id. Let’s determine what value a project ID would have for a breast cancer sample.
projClasses <- cases() %>%
facet('project.project_id') %>%
aggregations()
head(projClasses$project.project_id)
## key doc_count
## 1 FM-AD 18004
## 2 TARGET-NBL 1120
## 3 TCGA-BRCA 1098
## 4 MMRF-COMMPASS 995
## 5 TARGET-AML 988
## 6 TARGET-WT 652
We can see that one of the most popular entries under ‘project_id’ is ‘TCGA-BRCA’, giving 1098 cases of breast cancer as of July 2019. We will focus on this dataset going forward.
Identifying TCGA identifier and Downloading Sample Data
We need to ensure that for the breast cancer sample attributes we download, that there is a TCGA ID related to the case number. We’ll need that for identifying the right RNA-Seq identifiers in other publications later on.
The ‘case_id’ is needed for joining data across tables or object, but its format is not in the TCGA format. e.g. 87b85935-a058-44ad-8fb6-8511130eaffe
We start by figuring out what the ‘sample_id’ might be called in GDC. We’ll use grep() for this:
sample_fields <- grep('sample_id', available_fields('cases'), value=TRUE)
print(sample_fields)
## [1] "sample_ids" "samples.sample_id" "submitter_sample_ids"
More than one identifier will serve our purpose, as long as it begins with TCGA, and contains the TSS, Participant, and Sample fields. Based on iterating through this code previously, it was found that submitter_sample_ids will suffice for our purpose, as shown with the following code. We are going to pull all the sample attribute data that we’ll need, and we’ll look at it step-by-step below, starting with the submitter_sample_ids.
expands = c("diagnoses","annotations",
"demographic","exposures")
brcaResults <- cases() %>%
GenomicDataCommons::filter(~ project.project_id == 'TCGA-BRCA') %>%
GenomicDataCommons::select("submitter_sample_ids") %>%
GenomicDataCommons::expand(expands) %>%
response_all()
print(c("`f06f09f3-8133-4a92-ac86-fbe64295e0d8` ->", (list(brcaResults$results$submitter_sample_ids$`f06f09f3-8133-4a92-ac86-fbe64295e0d8`))))
## [[1]]
## [1] "`f06f09f3-8133-4a92-ac86-fbe64295e0d8` ->"
##
## [[2]]
## [1] "TCGA-E9-A1ND-10A" "TCGA-E9-A1ND-01Z" "TCGA-E9-A1ND-01A"
## [4] "TCGA-E9-A1ND-11A"
We see that the TSS, Participant, and Sample fields are present in these submitter_sample_ids. • The case_id begins with ‘f06f…’ • We see that it maps to 4 distinct submitter_sample_ids. Each one begins with ‘TCGA-E9-A1ND’. • “E9” is the code for the “Breast Invasive Carcinoma” study at the Asterand Biosciences. • “A1ND” is the patient/participant code for that study. We notice that these values are the same for all 4 submitter_sample_id codes, which says that all 4 samples were generated from the same patient. • 2 of the sample_code fields are “01Z” and “01A”. “01”” refers to a “Primary Solid Tumor”, and A and Z refer to the fact that these are 2 separate samples. I assume that are labeled A and Z (rather than A and B) because they are the 1st and 2nd Primary Solid Tumors out of a total of 2 (rather than the 1st and 2nd out of several). • We also notice that this patient has 2 other samples whose sample type is 10 = “Blood-Derived Normal Cell” and 11 = “Solid Tissue Normal”. That means that 2 non-cancerous controls were taken from this patient, in addition to the 2 primary solid tumor samples taken.
Exploring brcaResults Data Object
brcaResults is the name of the GDCcasesResponse object, which is fundamentally a list, that was returned from the data query to GDC. We’d like to understand how the demographic, diagnoses, exposure, and annotations objects are organized.
dem <- class(brcaResults$results$demographic)
diag <- class(brcaResults$results$diagnoses)
annot <- class(brcaResults$results$annotations)
exp <- class(brcaResults$results$exposures)
print(c("demographic ->", dem, "diagnostic ->", diag, "exposures -> ", exp, "annotations -> ", annot))
## [1] "demographic ->" "data.frame" "diagnostic ->" "list"
## [5] "exposures -> " "list" "annotations -> " "list"
We can see that demographic is a data frame. (It contains the demographic data directly.) Right now, we’ll explore how the annotations and diagnoses lists are laid out, and how sparse they are.
lists <- c("brcaResults$results$annotations", "brcaResults$results$diagnoses")
for(listName in lists){
numAnnotNull <- 0
resNum <- 0
for(result in eval(parse(text=listName))){
resNum = resNum+ 1
if(is.null(result)){
numAnnotNull = numAnnotNull + 1
}
}
print(paste("The fraction of ", listName, " records that are empty is ", numAnnotNull/resNum))
}
## [1] "The fraction of brcaResults$results$annotations records that are empty is 0.973588342440802"
## [1] "The fraction of brcaResults$results$diagnoses records that are empty is 0.000910746812386157"
Clearly, the annotations list is almost completely empty, so we’ll want to ignore it for this work, while the diagnoses list is almost entirely populated and we’ll want to use it.
The structure of the exposures list of dataframes is a bit different, as empty fields are still populated with a header. Let’s see how well it’s populated:
fields <- c("cigarettes_per_day", "alcohol_intensity", "bmi", "years_smoked")
numAnnotNull <- 0
resNum <- 0
for(result in brcaResults$results$exposures){
if(!is.null(result)){
resNum = resNum + 1
if(all(is.na(result[fields])) & result$alcohol_history=="Not Reported"){
numAnnotNull = numAnnotNull + 1
}
}
}
print(paste("The fraction of the exposures records that are empty is ", numAnnotNull/resNum))
## [1] "The fraction of the exposures records that are empty is 1"
So clearly the exposures list has no useful information.
Selecting Relevant Fields in demographic and diagnoses data objects
Now that we’ve decided to not use the annotations and exposures data lists, we’ll see what fields in the demographic and diagnoses objects we might be interested in.
Let’s see what fields are available in the demographic data frame:
colnames(brcaResults$results$demographic)
## [1] "updated_datetime" "created_datetime" "gender"
## [4] "state" "submitter_id" "year_of_birth"
## [7] "race" "days_to_birth" "ethnicity"
## [10] "vital_status" "demographic_id" "age_at_index"
## [13] "year_of_death" "days_to_death"
As we’re interested in characteristics about the patients and not the time when the data were entered into the GDC database, we’ll subset these fields and look at the first few rows in tibble format.
importantDemographicFields <- c("gender", "race", "days_to_birth", "ethnicity",
"vital_status", "age_at_index", "days_to_death",
"year_of_birth", "year_of_death")
brcaData1098 <- brcaResults$results$demographic[importantDemographicFields]
brcaData1098 <- cbind(rownames(brcaData1098), brcaData1098)
colnames(brcaData1098) <- c("sampleId", importantDemographicFields)
brcaData1098 <- as_tibble(brcaData1098)
head(brcaData1098, n=4)
## # A tibble: 4 x 10
## sampleId gender race
## <fct> <chr> <chr>
## 1 87b85935-a058-44ad-8fb6-8511130eaffe female white
## 2 dd96a9c7-899c-47cd-a0f9-b149ed07a5d6 female white
## 3 3f834fa7-6d7b-4b85-98c0-5c55d55b6c95 female black or african american
## 4 451e1a67-47e6-4738-99d7-fb7771ef61a3 female white
## days_to_birth ethnicity vital_status age_at_index
## <int> <chr> <chr> <int>
## 1 -30830 not hispanic or latino Alive 84
## 2 -19822 not hispanic or latino Alive 54
## 3 -13982 not hispanic or latino Dead 38
## 4 -26941 not hispanic or latino Dead 73
## days_to_death year_of_birth year_of_death
## <int> <int> <int>
## 1 NA 1926 NA
## 2 NA 1956 NA
## 3 1993 1957 2000
## 4 3126 1921 2002
We can see that these selected fields are all important, and we can also deduce that “age_at_index” refers to the age at which the patient was diagnosed with breast cancer. The 4th patient in the output above was born in 1921, was 73 years old at “age_at_index”, had 3126 days_to_death (3126/365=8.5 years), and died in 2002. 1921 + 73 + 8.5 = 2002.5.
Now let’s look at what fields in the diagnoses list of dataframes might be of interest.
colnames(brcaResults$results$diagnoses$`87b85935-a058-44ad-8fb6-8511130eaffe`)
## [1] "year_of_diagnosis"
## [2] "classification_of_tumor"
## [3] "last_known_disease_status"
## [4] "updated_datetime"
## [5] "primary_diagnosis"
## [6] "submitter_id"
## [7] "tumor_stage"
## [8] "age_at_diagnosis"
## [9] "morphology"
## [10] "days_to_last_known_disease_status"
## [11] "created_datetime"
## [12] "prior_treatment"
## [13] "state"
## [14] "days_to_recurrence"
## [15] "diagnosis_id"
## [16] "tumor_grade"
## [17] "icd_10_code"
## [18] "days_to_diagnosis"
## [19] "tissue_or_organ_of_origin"
## [20] "progression_or_recurrence"
## [21] "prior_malignancy"
## [22] "synchronous_malignancy"
## [23] "site_of_resection_or_biopsy"
## [24] "days_to_last_follow_up"
Of these diagnoses fields, we’ll use the following: “primary_diagnosis”, “tumor_stage”, “age_at_diagnosis”, “morphology”, “prior_treatment”, “tissue_or_organ_of_origin”, “prior_malignancy”.
‘tumor_grade’ is not reported, ‘icd_10_code’ is always C50.9 (invasive breast cancer), and “days_to_recurrence” is always NA, so they won’t be used.
importantDiagnosisFields <- c("primary_diagnosis", "tumor_stage", "age_at_diagnosis",
"morphology", "prior_treatment", "tissue_or_organ_of_origin",
"prior_malignancy")
brcaSampleIDs1098 <- names(brcaResults$results$diagnoses)
diagnostics <- brcaResults$results$diagnoses[[brcaSampleIDs1098[1]]][importantDiagnosisFields]
sampleId <- c(brcaSampleIDs1098[[1]])
for(id in brcaSampleIDs1098[2:length(brcaSampleIDs1098)]){
record <- brcaResults$results$diagnoses[[id]][importantDiagnosisFields]
if(!is.null(record)){ # handles NULL record
diagnostics <- rbind(diagnostics, record)
sampleId <- c(sampleId, id)
}
}
diagnostics <- cbind(sampleId, diagnostics)
diagnostics <- as_tibble(diagnostics)
Joining together the 2 Tibbles
We’ll now want to join together all the columns from the brcaData1098 tibble, which contains the demographic attributes of interest, and the diagnostics tibble, which contains the cancer tumor (or healthy sample) attributes of interest.
suppressWarnings(brcaSampleData1097 <- inner_join(brcaData1098, diagnostics))
## Joining, by = "sampleId"
As there was a record that existed in demographics that did not exist in diagnoses, that row was removed, bringing the number of patients in our full breast cancer dataset to 1097. The Tibble storing the data is called brcaSampleData1097.
Adding TCGA Identifiers to Dataframe
One other piece of information we’ll need is to add on the list of TCGA IDs associated with each sampleId.
# set up first row of expanded data frame/tibble:
smpId <- as.character(brcaSampleData1097[1,"sampleId"])
tcgaId <- brcaResults$results$submitter_sample_ids[smpId]
tmp <- cbind(brcaSampleData1097[1,], paste(unlist(tcgaId), collapse=";")) #tmp dataframe
# loop through other rows and add them
for(i in 2:dim(brcaSampleData1097)[1]){
smpId <- as.character(brcaSampleData1097[i,"sampleId"])
tcgaId <- brcaResults$results$submitter_sample_ids[smpId]
tmp <- rbind(tmp, cbind(brcaSampleData1097[i,], paste(unlist(tcgaId), collapse=";")))
}
# add new column name, convert to tibble and print first 4 rows:
names(tmp)[names(tmp) == 'paste(unlist(tcgaId), collapse = ";")'] <- 'tcgaID'
brcaSampleData1097 <- as_tibble(tmp)
head(brcaSampleData1097, n=4)
## # A tibble: 4 x 18
## sampleId gender race
## <chr> <chr> <chr>
## 1 87b85935-a058-44ad-8fb6-8511130eaffe female white
## 2 dd96a9c7-899c-47cd-a0f9-b149ed07a5d6 female white
## 3 3f834fa7-6d7b-4b85-98c0-5c55d55b6c95 female black or african american
## 4 451e1a67-47e6-4738-99d7-fb7771ef61a3 female white
## days_to_birth ethnicity vital_status age_at_index
## <int> <chr> <chr> <int>
## 1 -30830 not hispanic or latino Alive 84
## 2 -19822 not hispanic or latino Alive 54
## 3 -13982 not hispanic or latino Dead 38
## 4 -26941 not hispanic or latino Dead 73
## days_to_death year_of_birth year_of_death
## <int> <int> <int>
## 1 NA 1926 NA
## 2 NA 1956 NA
## 3 1993 1957 2000
## 4 3126 1921 2002
## primary_diagnosis tumor_stage age_at_diagnosis
## <chr> <chr> <int>
## 1 Mucinous adenocarcinoma stage iia 30830
## 2 Infiltrating duct carcinoma, NOS stage iib 19822
## 3 Lobular carcinoma, NOS stage iiia 13982
## 4 Infiltrating duct and lobular carcinoma not reported 26941
## morphology prior_treatment tissue_or_organ_of_origin prior_malignancy
## <chr> <chr> <chr> <chr>
## 1 8480/3 No Breast, NOS no
## 2 8500/3 No Breast, NOS no
## 3 8520/3 No Breast, NOS no
## 4 8522/3 No Breast, NOS no
## tcgaID
## <fct>
## 1 TCGA-D8-A1XV-10A;TCGA-D8-A1XV-01A;TCGA-D8-A1XV-01Z
## 2 TCGA-D8-A1JB-10A;TCGA-D8-A1JB-01Z;TCGA-D8-A1JB-01A
## 3 TCGA-B6-A0IE-01A;TCGA-B6-A0IE-01Z;TCGA-B6-A0IE-10A
## 4 TCGA-B6-A0RP-01A;TCGA-B6-A0RP-01Z;TCGA-B6-A0RP-10A
Exporting Full Set of Breast Cancer Sample Attributes
As part of our work, we’ll want to work on the full set of breast cancer samples. So we’ll export the brcaSampleData1097 data frame at this point for later use.
write.table(brcaSampleData1097, "TCGA_Attributes_full.txt", sep="\t", row.names = FALSE)
Exploring Sub-Classes of Breast cancer
For some work we’ll do on data simulation, we’ll subset the full attribute data frame to those having specific attributes.
We select out only those cases that are breast cancer and see what data are available to us:
cases() %>%
GenomicDataCommons::filter( ~ project.project_id == 'TCGA-BRCA') %>% # '~' needed
facet('disease_type') %>%
aggregations() %>%
head()
## $disease_type
## key doc_count
## 1 Ductal and Lobular Neoplasms 1054
## 2 Cystic, Mucinous and Serous Neoplasms 16
## 3 Complex Epithelial Neoplasms 14
## 4 Epithelial Neoplasms, NOS 5
## 5 Adenomas and Adenocarcinomas 3
## 6 Fibroepithelial Neoplasms 2
## 7 Squamous Cell Neoplasms 2
## 8 Adnexal and Skin Appendage Neoplasms 1
## 9 Basal Cell Neoplasms 1
We see that ‘Ductal and Lobular Neoplasms’ cover almost all cases. Analysis of the most common values in the field primary_diagnosis shows that 778 samples have the specific diagnosis “Infiltrating duct carcinoma, NOS” (see below).
tbl <- table(brcaSampleData1097$primary_diagnosis)
kable(head(sort(tbl, decreasing=TRUE), n=5))
Var1 | Freq |
---|---|
Infiltrating duct carcinoma, NOS | 778 |
Lobular carcinoma, NOS | 201 |
Infiltrating duct and lobular carcinoma | 28 |
Infiltrating duct mixed with other types of carcinoma | 19 |
Mucinous adenocarcinoma | 16 |
Subsetting Samples
In an analysis being performed modeling healthy and tumor samples, limitations are being placed on which samples will be used to reduced confounding.
From the full brcaSampleData1097 breast cancer dataset, we’ll filter out: (a) those samples that are not ‘Infiltrating duct carcinoma, NOS’ (b) men (12 samples in total) (c) those from women not between ages 40-49 at the time of diagnosis (d) those that have previous treatments or previous diagnoses (e) those from women who are not white and those who are Latino
brcaDuctF40W_86 <- brcaSampleData1097 %>% # start with 1097 samples
dplyr::filter(primary_diagnosis=='Infiltrating duct carcinoma, NOS') %>% # 778 remaining
dplyr::filter(gender=="female") %>% # 767 remaining
dplyr::filter(age_at_index >= 40 & age_at_index < 50) %>% # 159 remaining
dplyr::filter(race=="white") %>% # 100 remaining
dplyr::filter(ethnicity=="not hispanic or latino") # 86 remaining
Writing Desired Attributes to a File
We will write the brcaDuctF40W_86 dataframe to a corresponding file.
write.table(brcaDuctF40W_86, "brcaDuctF40W_86.txt", sep="\t", row.names=FALSE)
This file can now be used to study samples from various studies.