How to Access Files for Specific Case IDs¶


This notebook demonstrates how to build a cohort of MIDRC patients based on clinical and demographic data and then obtain a file download manifest for x-ray and annotation files related to that cohort.

by Chris Meyer, PhD

Manager of Data and User Services at the Center for Translational Data Science at University of Chicago

August 2023

1) Set up Python environment¶


Set local variables¶


Change the following directory paths to a valid working directories where you're running this notebook.

In [5]:
cred = "/Users/christopher/Downloads/midrc-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally
api = "https://data.midrc.org" # The base URL of the data commons being queried. This shouldn't change for MIDRC.

Install / Import Python Packages and Scripts¶

In [6]:
## The packages below may be necessary for users to install according to the imports necessary in the subsequent cells.

#!pip install --upgrade pandas
#!pip install --upgrade --ignore-installed PyYAML
#!pip install --upgrade pip
#!pip install --upgrade gen3
In [7]:
## Import Python Packages and scripts

import os
import gen3

from gen3.auth import Gen3Auth
from gen3.query import Gen3Query

Initiate instances of the Gen3 SDK Classes using credentials file for authentication¶


Again, make sure the "cred" directory path variable reflects the location of your credentials file (path variables set above).

In [8]:
auth = Gen3Auth(api, refresh_file=cred) # authentication class
query = Gen3Query(auth) # query class

2) Build a cohort of cases by running queries against MIDRC APIs¶


  • There are many ways to query and access metadata for cohort building in MIDRC, but this notebook will focus on using the Gen3 graphQL query service "guppy". This is the backend query service that MIDRC's data explorer GUI uses. So, anything you can do in the explorer GUI, you can do with guppy queries, and more!
  • The guppy graphQL service has more functionality than is demonstrated in this simple example with extensive documentation in GitHub here in case you'd like to build your own queries from scratch.
  • The Gen3 SDK (intialized as "query" above in this notebook) has Python wrapper scripts to make sending queries to the guppy graphQL API simpler. The guppy SDK package can be viewed in GitHub here.

Set 'case' query parameters¶


  • Below, we first set some query parameters. Feel free to modify these parameters to see how it changes the query response. Setting these patient attributes is akin to selecting a filter value in MIDRC's data explorer GUI.
  • To see more documentation about to use and combine filters with various operator logic (like AND/OR/IN, etc.) see this page.
  • We then send our query to MIDRC's guppy API endpoint using the Gen3Query SDK package we initialized earlier.
  • If our query request is successful, the API response should be in JSON format, and it should contain a list of patient IDs along with any other patient data we ask for.
In [9]:
#### "case" query parameters
## In this example, we're going to filter our patient cohort by asking for Asian male patients between the age of 40 and 89 that tested positive for COVID-19.

## case demographic filters
sex = "Male"
min_age = 50
max_age = 89

#### "nested" filters, these are attributes from other nodes that are nested under the case node ("child nodes" of case in the data model: data.midrc.org/dd)
## medications (vaccine data)
medication_manufacturer = ["Pfizer","Moderna"] #,"Janssen","AstraZeneca","Sinopharm","Novavax"]

## measurements filters (COVID-19 test data)
test_method = ["RT-PCR"] #,"Rapid antigen test"]
test_result_text = ["Positive","Negative"]

## conditions filters (co-morbidities and long COVID)
condition_name = ["COVID-19","Post COVID-19 condition, unspecified"] #,"Pneumonia, organism unspecified"]

## procedures filters
procedure_name = ["Breathing Support"]
In [10]:
## Here is an example getting all the cases in a particular project between ages of 45 and 47
## the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.

cases = query.raw_data_download(
                    data_type="case",
#                    fields=["project_id","submitter_id"],
                    fields=None,
                    filter_object={
                        "AND": [
                            {"=": {"sex": sex}},
                            {">=": {"age_at_index": min_age}},
                            {"<=": {"age_at_index": max_age}},
                            {"nested": {"path": "medications", "IN": {"medication_manufacturer": medication_manufacturer}}},
                            {"nested": {"path": "measurements", "IN": {"test_method": test_method}}},
                            {"nested": {"path": "measurements", "IN": {"test_result_text": test_result_text}}},
                            {"nested": {"path": "conditions", "IN": {"condition_name": condition_name}}},
                            #{"nested": {"path": "procedures", "IN": {"procedure_name": procedure_name}}}, # adding too many filters returns no data
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(cases) > 0 and "submitter_id" in cases[0]:
    case_ids = [i['submitter_id'] for i in cases] ## make a list of the case (patient) IDs returned
    print("Query returned {} case IDs.".format(len(cases)))
    print("Data is a list with rows like this:\n\t {}".format(cases[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")
In [11]:
## Look at one record returned by the query
# Note: the "object_id" field is a list of all file identifiers associated with the case
cases[0]

3) Send another query to get data file details for our cohort / case ID¶


The object_id field in each case record above contains the file identifiers for all files associated with each case. If we simply want to access all files associated with our list of cases, we can use those object_ids. However, in this example, we'll ask for specific types of files and get more detailed information about each of the files. This is achieved by querying the "data_file" index and adding our cohort (list of case_ids) as a filter.

  • Note: all MIDRC data files, including both images and annotations, are listed in the guppy index "data_file", which is queried in a similar manner to our query of the "case" index above. The query parameter "data_type" below determines which Elasticsearch index we're querying.

Set 'data_file' query parameters¶


Here, we'll utilize the property "source_node" to filter the list of files for our cohort to only those matching the type of files we're interested in. In this example, we ask for CR and DX images and any associated annotation files.

  • Note: We're using the property "case_ids" as a filter to restrict the data_file records returned down to those associated with cases in our cohort built above. If you'd like to search for only one specific case_id, you can manually set the case_ids variable like this:
case_ids = ["my_case_id"]
  • Or alternatively, you could set the query filter like this:
{"=": {"case_ids": "my_case_id"}},

where "my_case_id" is the quoted submitter_id of the case you're searching for.

In [12]:
source_nodes = ["cr_series_file","dx_series_file","annotation_file","dicom_annotation_file"]
modality = ["SEG", "CR", "DX", ] # this is somewhat redundant with the above source_node filter, but added here for demonstration purposes
In [13]:
## Search for specific files associated with our cohort by adding "case_ids" as a filter
# * Note: "fields" is set to "None" in this query, which by default returns all the properties available
data_files = query.raw_data_download(
                    data_type="data_file",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"case_ids": case_ids}},
                            {"IN": {"source_node": source_nodes}},
                            {"IN": {"modality": modality}},
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(data_files) > 0:
    object_ids = [i['object_id'] for i in data_files if 'object_id' in i] ## make a list of the file object_ids returned by our query
    print("Query returned {} data files with {} object_ids.".format(len(data_files),len(object_ids)))
    print("Data is a list with rows like this:\n\t {}".format(data_files[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")
In [14]:
## View the detailed data for the first file returned
data_files[0]

4) Access data files using their object_id / data GUID (globally unique identifiers)¶


In order to download files stored in MIDRC, users need to reference the file's object_id (AKA data GUID or Globally Unique IDentifier).

Once we have a list of GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files. You can also access individual files in your browser after logging-in and entering the GUID after the files/ endpoint, as in this URL: https://data.midrc.org/files/GUID

where GUID is the actual GUID, e.g.: https://data.midrc.org/files/dg.MD1R/b87d0db3-d95a-43c7-ace1-ab2c130e04ec

For instructions on how to install and use the gen3-client, please see the MIDRC quick-start guide, which can be found linked here and in the MIDRC data portal header as "Get Started".

Below we use the gen3 SDK command gen3 drs-pull object which is documented in detail here.

Parse the data_file query response to build a list of all object_ids returned for our cohort.¶

In [15]:
## Build a list 
object_ids = []
for data_file in data_files:
    if 'object_id' in data_file:
        object_id = data_file['object_id']
        object_ids.append(object_id)

object_id = object_ids[1]
print("The first object_id of {}: '{}'".format(len(object_ids),object_id))

Use the Gen3 SDK command gen3 drs-pull object to download an individual file¶

In [16]:
## Make a new directory for downloaded files
os.system("mkdir -p downloads")
In [17]:
## Run the "gen3 drs-pull object" command to download a file
cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull object {} --output-dir downloads".format(cred,object_id)
os.system(cmd)
In [18]:
!find downloads -name "*dcm"

Use a simple loop to download all the files¶

In [19]:
## Simple loop to download all files and keep track of success and failures
cred = "/Users/christopher/Downloads/midrc-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally
success,failure,other=[],[],[]
count,total = 0,len(object_ids)
for object_id in object_ids:
    count+=1
    cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull object {} --output-dir downloads".format(cred,object_id)
    stout = subprocess.run(cmd, shell=True, capture_output=True)
    print("Progress ({}/{}): {}".format(count,total,stout.stdout))
    if "failed" in str(stout.stdout):
        failure.append(object_id)
    elif "successfully" in str(stout.stdout):
        success.append(object_id)
    else:
        other.append(object_id)
In [20]:
!find downloads -name "*.dcm"
In [21]:
!find downloads -name "*.dcm" | wc -l

The End¶


If you have any questions related to this notebook don't hesitate to reach out to the MIDRC Helpdesk at midrc-support@gen3.org or the author directly at cgmeyer@uchicago.edu

Happy data wrangling!

In [ ]: