Tuesday, January 11, 2011

A Primer on XML, RDF, JSON, and Metadata

A new workgroup, formed under the auspices of the HIT Policy Committee and the HIT Standards Committee is beginning its work to help ONC analyze public comments on the President’s Council of Advisors on Science and Technology (PCAST) report, discuss the implications of the report on current ONC strategies, assess the feasibility and impact of the PCAST report on ONC programs, and elaborate on how these recommendations could be integrated into the ONC strategic framework.

Membership includes:
Paul Egerman, Entrepreneur, Chair
William Stead Vanderbilt University, Vice-Chair
Dixie Baker,SAIC
Hunt Blair,Vermont HIE
Tim Elwell, Misys Open Source
Carl A. Gunter, University of Illinois
John Halamka, Beth Israel Deaconess Medical Center, HMS
Leslie Harris, Center for Democracy & Technology
Stan Huff, Intermountain Healthcare
Robert Kahn, Corporation for National Research Initiatives
Gary Marchionini, University of North Carolina
Stephen Ondra, Office of Science & Technology Policy
Jonathan Perlin, Hospital Corporation of America
Richard Platt,Harvard Medical School
Wes Rishel, Gartner
Mark Rothstein, University of Louisville
Steve Stack, American Medical Association
Eileen Twiggs, Planned Parenthood

To advise ONC about the report's recommendations, workgroup members need to understand terms such XML, RDF, JSON and Metadata as well as learn about the standards efforts to date to create human readable and computable data elements for healthcare.

XML is an abbreviation for Extensible Markup Language, a set of rules for encoding documents in machine-readable form.   Here's an example of data about me in XML, which is both human readable and computable
<name><fullname>John David Halamka, M.D.</fullname><firstname>John</firstname><lastname>Halamka</lastname></name>
<address>
<address1>Beth Israel Deaconess Med Ctr</address1><address2>Information Systems, 6th Fl</address2><address3>1135 Tremont St  </address3><address4>Roxbury Crossing, MA 02120</address4><telephone>617/754-8002</telephone><fax>617/754-8015</fax><latitude>42.33555200000000</latitude><longitude>-71.08822700000000</longitude></address>

It's a machine friendly form of my Harvard Catalyst Profiles web page with discrete data elements that any computer language can interpret and search.   The complete XML document about me is available here.

XML has been used to describe healthcare data by HL7 using the Clinical Document Architecture (CDA) and by ASTM using the Continuity of Care Record (CCR)

Here's an example of CDA that illustrates immunizations
<informationsource><author><authortime value="20000407130000+0500"><authorname><prefix>Dr.</prefix><given>Robert</given><family>Dolin</family></authorname></authortime></author></informationsource>
<immunizations><immunization><administereddate value="199911"><medicationinformation><codedproductname code="88" codesystem="2.16.840.1.113883.6.59" displayname="Influenza virus vaccine"><freetextproductname>Influenza virus vaccine</freetextproductname></codedproductname></medicationinformation></administereddate></immunization></immunizations>

Metadata is "data about data" - the details behind this data such as who gathered it, when, and for what purpose.

The metadata in the CDA example includes an Object Identifier (OID) of 2.16.840.1.113883.6.59 which is a code for the Center for Disease Control's CVX immunization vocabulary.   Code 88 is the CVX code for Influenza virus vaccine.   The vaccine was administered in November of 1999.   The information source is Bob Dolin.  The full CDA summary is available here.

Here's an example of CCR that illustrates immunizations
<actor><actorobjectid>AA0001</actorobjectid><person><name><currentname><given>John</given><middle>David</middle><family>Halamka</family></currentname></name><dateofbirth><exactdatetime>1962-05-23T04:00:00Z</exactdatetime></dateofbirth><gender><text>M</text></gender></person></actor>
<address>
<type><text>Home</text></type><line1>11 Alden Road</line1><city>Wellesley</city><state>MA</state><postalcode>02481</postalcode></address>
<telephone><value>781-239-9771</value><type><text>Home</text></type></telephone><actor><actorid>AA0001</actorid></actor>

<immunization><ccrdataobjectid>BB0024</ccrdataobjectid><datetime><type><text>Date Updated</text></type><exactdatetime>2011-01-08T19:49:19Z</exactdatetime></datetime><datetime><type><text>Start date</text></type><exactdatetime>2010-10-11T04:00:00Z</exactdatetime></datetime><type><text>Immunization</text></type><actor><actorid>AA0001</actorid></actor><product><productname><text>Tetanus</text><code><value>35</value><codingsystem>HL7 CVX</codingsystem><version>2.5</version></code><code><value>396412003</value><codingsystem>SNOMEDCT</codingsystem><version>2005</version></code><code><value>C0039619</value><codingsystem>UMLS Concept ID</codingsystem><version>2005</version></code></productname></product></immunization>
<form>
<text>Toxoid</text></form>
<directions><direction><route><text>IM</text></route><site><text>Right Arm</text></site></direction></directions>

The metadata in the CCR example includes that the patient is John Halamka, born 5/23/1962, Male, lives in Wellesley.  Additional metadata identifies that a tetanus shot exists in the record.   The concept "Tetanus shot" is described using the Center for Disease Control's CVX immunization vocabulary, the SNOMED-CT vocabulary, and the National Library of Medicine Meta-thesaurus vocabulary.  Metadata about the reliability of the information includes who reported the tetanus shot and when it was reported.   The metadata in my record describes me as the source of the reported information, updated January 8, 2011.  The full CCR summary is available here.

XML is a very general construct.   Anyone can create any tags for data and metadata.   HL7 has chosen to create a Reference Information Model (RIM) to describe the meaning of its tags and metadata.  ASTM has created a well described fixed set of data elements.   The challenge that different XML tagging creates is that you have to figure out where to look for the information you want.  For the XML example above about my name and address, everyone creating a person directory could create the XML differently.  In one directory, a person's "lastName" could be root element, in another it could be a child of an element called "name", in another it could an attribute of a "person" element.  The XML below is just as valid a way to describe my address as the example above
<address city="Boston" postalcode="02120" state="MA" streetaddress="1135 Tremont">
  <phonenumbers></phonenumbers>
    <phonenumber number="617 754-8002" type="home"></phonenumber>
    <phonenumber  number="617 754-8015" type="fax"></phonenumber>
</address>

The Resource Description Framework (RDF) is a metadata model that provides a standardized approach to describing web resources.   The general idea is to provide a subject-predicate-object model such that the predicate includes of definition of what is being described.  RDF was created to solve the problem of organizations implementing XML tags heterogeneously.

Here's an RDF description of me
<rdf:description rdf:about="http://connects.catalyst.harvard.edu/profiles/profile/person/46034/viewas/rdf" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:core="http://vivoweb.org/ontology/core#" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:vitro="http://vitro.mannlib.cornell.edu/ns/vitro/public#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"></rdf:description>
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"></rdf:type>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"></rdf:type>
<rdf:type rdf:resource="http://purl.org/ontology/bibo/core#Faculty"></rdf:type>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"></rdf:type>
<rdfs:label xml:lang="en-US">John David Halamka, M.D.</rdfs:label>
<rdf:type rdf:resource="http://vivoweb.org/ontology/core#FacultyMember"></rdf:type>
<foaf:lastname>Halamka</foaf:lastname>
<foaf:firstname>John</foaf:firstname>
<core:preferredtitle>Associate Professor of Medicine</core:preferredtitle>
<core:workfax>617/754-8015</core:workfax>

The subject is my Harvard Catalyst Profiles Page.

The predicates include "the subject has lastname, a firstname, and a preferred title"

The objects are Halamka, John, and Associate Professor.

The definitions of lastname, firstname, and preferred title are found in two places - the friend of a friend definition site and the VIVO site.    The complete RDF document about me is available here.

Thus, RDF provides a means of displaying metadata while also enabling easy access to the definitions of data elements used.

With RDF,  data is always represented as subjects, predicates, and objects, so reading, parsing, and storing it is consistent across all applications. It also enables query of different systems via a common approach . For example, if I exist as a faculty member in Profiles and as a provider in a clinical system that uses RDF, it should be possible to query for topics where I have both faculty and clinical expertise, without having to transform one data source into the other's schema. Similarly, if the government makes all grants, publications, trials, etc. available in RDF, then these things should automatically be available to tools like Profiles, without having to write any additional code.

There is a standard query language called SPARQL that can be used to search RDF resources.

Finally, there is an emerging alternative to XML called Javascript Object Notation (JSON) that is more compact that XML and easier for computer languages to manipulate than XML.  Here's an example of my address information in JSON
{
     "firstName": "John",
     "lastName": "Halamka",
     "age": 48,
     "address":
     {
         "streetAddress": "1135 Tremont",
         "city": "Boston",
         "state": "MA",
         "postalCode": "02120"
     },
     "phoneNumber":
     [
         {
           "type": "office",
           "number": "617-754-8002"
         },
         {
           "type": "fax",
           "number": "617-754-8015"
         }
     ]
 }

JSON has replaced XML as a data interchange format in many social networking applications.   It does have the same issue as XML that authors can create arbitrary formats, so there could be a person object containing firstname and lastname or lastname could be an object - you have to know the way the author organized the data before you can use it.

In summary, CDA and CCR already provide XML data for healthcare that is "data atomic", metadata rich, and searchable using standard tools.    RDF is a standardized way of describing metadata.  JSON is an efficient way of representing, transmitting, and interpreting data that is similar but more compact than XML.  

Our report is due in April.  I welcome the discussion with the PCAST workgroup over the next 3 months!

No comments:

Post a Comment