Friday, August 27, 2010

Counting and Summing - US States and their populations

Lets switch to the Census Bureau's SPARQL test page,http://www.rdfabout.com/demo/census/sparql.xpd, and find out some information about the 50 states of the United States.

It always helps to have the sparql engine give some guidance, or some starting queries. Still the bulk of your SPARQL queries will come from self-discovery of the subjects, predicates and the objects in the database.

Note: After playing a little with the Census Bureau's SPARQL test page, I switched back to the dbpedia Virtuoso browser, http://dbpedia.org/sparql because its has better support for aggregate functions such as COUNT and SUM, and it is easier to save the queries I am discussing and easier to show you the live results.


ANALYZING THE QUERY ABOUT THE UNITED STATES

Let's look at the query in the query box more closely.

<http://www.rdfabout.com/rdf/usgov/geo/us> is our subject.

?p represents any predicate, or column name about the US

?o represents any object or column value.

This is a very common way to initially explore what the data set contains about the subject you are interested in. After you see the values returned, you can begin to restrict the predicates or objects to close in on what you are interested in. This wild card type of querying also helps with copying a workable predicate or object in case your first attempts at removing the PREFIX fail.

FILTER(!isBlank(?o)) . - removes predicates without meaningful information.

FILTER(!regex(str(?o), "zcta")) . removes about 100 different census groups, subdivisions of the United States that are probably only meaningful to the census team.


See the results.

JUST SHOW THE STATES

We'll focus on just the states by specifying the predicate <http://purl.org/dc/terms/hasPart> using the PREFIX representation.

We'll also order by state to make duplicate checking easier.


PREFIX     rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms:  <http://purl.org/dc/terms/>

SELECT * WHERE {
  <http://www.rdfabout.com/rdf/usgov/geo/us>
    dcterms:hasPart
     ?o .
  FILTER(!isBlank(?o)) .
  FILTER(!regex(str(?o), "zcta")) .
}
order by ?o
See the results.

Now we have only states, but do we have all 50 states?

LETS COUNT THE STATES

Counting in SPARQL is a little bit of a pain with the virtuoso browser because you have to feed the count function another, duplicate set of data. In this case, I called the objects ?state when counting them.


PREFIX     geo: 
PREFIX     rdf: 
PREFIX dcterms: 

SELECT ?o, count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
    dcterms:hasPart
     ?o .
  FILTER(!isBlank(?o)) .
  FILTER(!regex(str(?o), "zcta")) .
 } UNION {
  <http://www.rdfabout.com/rdf/usgov/geo/us>
    dcterms:hasPart
     ?state .
  FILTER(!isBlank(?state)) .
  FILTER(!regex(str(?state), "zcta")) .
 }
}
order by ?o
See the results.

We got 52 states. Let's add the state name to find the extra records.


WHAT IS THE STATE NAME CALLED?

To find out what the state name is called, we have to explore the state data by getting another set of triplets using the clause,?o ?p2 ?o2. This means, for the same object return all predicates identified as p2 and those detail objects, identified as ?o2.

We also add the new fields ?p2 and ?o2 to our select list, the variables we want to display.


PREFIX     rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms:  <http://purl.org/dc/terms/>

SELECT ?o, count(distinct ?state) as ?stateCount, ?p2, ?o2 WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
          ?o
               ?p2
                  ?o2
              .
          
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
  

See the results.

I search all the records for the string Alaska and find it as the object of the predicate, http://purl.org/dc/elements/1.1/title. That is the predicate for state name.


DISPLAY THE STATE NAME

We will break the state name predicate into 2 pieces and now show just the fields we really want.


PREFIX           rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX       dcterms:  <http://purl.org/dc/terms/>
PREFIX    dcelements:  <http://purl.org/dc/elements/1.1/>

SELECT ?o,?stateName, count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
            ?o
               dcelements:title
                  ?stateName            
             .
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
  

See the results.

Now I see why we have 52 records listed -- the count record adds 1 to the tally and the District of Columbia appears as a part of the United States.


DISPLAY THE STATE POPULATION

Going back to the same query that revealed the state name, I gathered that the state population was given by
http://www.rdfabout.com/rdf/schema/census/population.
I broke the population predicate into 2 pieces and now also show the population in the select list.


PREFIX        census: <http://www.rdfabout.com/rdf/schema/census/>
PREFIX    dcelements: <http://purl.org/dc/elements/1.1/>
PREFIX       dcterms: <http://purl.org/dc/terms/>
PREFIX           rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


SELECT ?o, ?stateName,?population, count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
            ?o
               dcelements:title
                  ?stateName;            
               census:population
                  ?population
              .
          
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
   

See the results.

Now I see population figures for each state, but I'm not sure at what time they were recorded. I would need to check them against other sources as well if I was trying to speak authoritatively about population.


SUM THE STATE POPULATIONS TO CALCULATE THE TOTAL US POPULATION

Now in the select list I'll sum up the individual state population figures that was produced in the 2nd unioned subquery that I had previously used to compute the state count. This produces a grand total population.


PREFIX        census: <http://www.rdfabout.com/rdf/schema/census/>
PREFIX    dcelements: <http://purl.org/dc/elements/1.1/>
PREFIX       dcterms: <http://purl.org/dc/terms/>
PREFIX           rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


SELECT ?o, ?stateName, 
?population,sum(?statePopulation) as ?totalPopulation,
count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
            ?o
               dcelements:title
                  ?stateName;            
               census:population
                  ?population
              .
          
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
                ?state
                 census:population
                   ?statePopulation
               .
            
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
   

See the results.

Now I see total population for the US, but at some unspecified period of time.

Prefixes in SPARQL Queries

While still using the dbpedia Virtuoso browser, http://dbpedia.org/sparql, pointing to a single page about Joan Baez, lets explore the use of prefixes to shorten our links.

Prefixes facilitate communication by helping you use standard predicates and objects to describe your data so the user doesn't have to work so hard to understand you.

SUPPLYING THE RIGHT PREDICATE FOR RDF:TYPE

Viewing the source of the Come from the Shadows page, and looking for the string, rdf:type, I find a simpler URI to use, http://www.w3.org/1999/02/22-rdf-syntax-ns# which works.


SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/artist>
       <http://dbpedia.org/resource/Joan_Baez> ;
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type > 
       ?type.
}
See the results.

ABBREVIATE A PREDICATE USING A PREFIX

Instead of using the full URI in the previous example, I can:

1) define a prefix, rdf:,

2) list the URI to identify the prefix enclosed in angled brackets, and then

3) use the prefix to express a URI more compactly, rdf:type

Doing that gives us the same results.


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/artist>
       <http://dbpedia.org/resource/Joan_Baez> ;
     rdf:type 
       ?type.
}
See the results.

Using the URI of the prefix, we can go to that web page and learn more about related predicates that might be available in that namespace. These namespaces are important for explaining whether, for example, the chip we are discussing is a computer chip, a potato chip, a paint chip or a cow chip.


OTHER COMMON PREFIX



PREFIX owl: <http://www.w3.org/2002/07/owl#>

Provides metadata about concepts and how they are related to other concepts.


PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

Provides information about how the information is structured and often used to specify the data type of a particular piece of information.


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

Provides metadata about the RDF data such as related concepts, user friendly labels, comments about the information.


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

Provides very fundamental classification of the information.


PREFIX foaf: <http://xmlns.com/foaf/0.1/>

Provides information about people and their inter-relationships (friend of a friend), and how to contact them.


PREFIX dc: <http://purl.org/dc/elements/1.1/>

Provides the Dublic Core items of metadata. It is often used to describe published information and the publisher of that information.


PREFIX : <http://dbpedia.org/resource/>

With this empty prefix, we could then refer to the dbpedia resource <http://dbpedia.org/resource/Joan_ Baez> simply as :Joan_Baez


PREFIX dbpedia2: <http://dbpedia.org/property/>

This describes the properties that Dbpedia uses to describe things or subjects it has information about. These properties are appropriate to the thing, so a river would have a different set of pre-defined properties than a mathemetician.


PREFIX dbpedia: <http://dbpedia.org/>

Provides information coming from dbpedia..


PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

Simple Knowledge Organization System (SKOS) is a common data model for sharing and linking knowledge organization systems via the Semantic Web.


PREFIX dbo: <http://dbpedia.org/ontology/>

Provides metadata about what entities exist, how they can be grouped, how they related within a hierarchy, and how they can be subdivided according to similarities and differences.

Thursday, August 26, 2010

Dbpedia Reverse Links - is dbpedia-owl:artist of

While still using the dbpedia Virtuoso browser, http://dbpedia.org/sparql, pointing to a single page about Joan Baez, lets explore what it takes to retrieve data from reverse links, in this case is dbpedia-owl:artist of.

When dbpedia shows is somePredicate of, then the subject of the current page, in this case Joan Baez, becomes the object on another page, while the values to the right become the subjects of those other pages. It will become clearer when you examine the results of the following query.


In the Joan Baez Page:

We see the predicate, is dbpedia-owl:artist of to the left, and a number of other dbpedia links listed as the object to the right. I'm showing just the first few entries because Joan Baez is quite a prolific artist.


In the   Come From the Shadows   Page:

I clicked on the first object link from the Joan Baez page, dbpedia:Come_from_the_Shadows to see what it was.

We see the predicate, dbpedia-owl:artist to the left, and dbpedia:Joan Baez listed as the object to the right.



JOAN BAEZ AS THE OBJECT INSTEAD OF THE SUBJECT

SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/artist>
       <http://dbpedia.org/resource/Joan_Baez>
 .
}

This query asks SPARQL TO list all subject, ?s, where the artist is Joan Baez.


RUN QUERY


See the results.

WHAT DID WE RETRIEVE?

It looks like these are her albums, but let's make sure. Lets get the rdf:type of each subject. I copied the link for rdf:type as I have done all the other links so far. This time it caused some problems.


SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/artist>
       <http://dbpedia.org/resource/Joan_Baez>;
    <http://dbpedia.org/about/html/http://www.w3.org/1999/02/22-rdf-syntax-ns%01type>
       ?type.
}

RUN QUERY

See the results.

The dreaded blank column headers doesn't tell me what's wrong, but the long predicate picked up from the link rdf:type looks suspicious. <http://dbpedia.org/about/html/http://www.w3.org/1999/02/22-rdf-syntax-ns%01type> is very long and different from our other links that worked.

SUPPLYING THE RIGHT PREDICATE FOR RDF:TYPE

Viewing the source of the Come from the Shadows page, and looking for the string, rdf:type, I find a simpler URI to use, http://www.w3.org/1999/02/22-rdf-syntax-ns# which works.


SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/artist>
       <http://dbpedia.org/resource/Joan_Baez> ;
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type > 
       ?type.
}
See the results.

IT WORKED BUT EACH ALBUM IS REPEATED MULTIPLE TIMES

The last set of results list Come from the Shadows 4 times because it's page gave it 4 different types: an owl thing, a MusicalWork, a Work and an Album. Most of the albums appear multiple times as well.


LIST EACH ALBUM ONCE

I looked through the Come from the Shadows page to find a description of what it was that was a single entry. I found the predicate, http://dbpedia.org/ontology/type. Let's try that as our predicate and sort by subject to see if we did indeed list each album only once.


SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/artist>
       <http://dbpedia.org/resource/Joan_Baez> ;
    <http://dbpedia.org/ontology/type > 
       ?type.
}
order by ?s
See the results.

Yes, mission accomplished! We have each album listed only once, and dbpedia has classified the albums as Album, Compilation_album, Live_album or a Studio_album.

Wednesday, August 25, 2010

Processing Dates - When was Joan Baez born

While still using the dbpedia Virtuoso browser, http://dbpedia.org/sparql, pointing to a single page about Joan Baez, lets explore what it takes to retrieve date information.

WHEN WAS SHE BORN AND WHEN SHE BECAME ACTIVE

SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/birthDate>
      ?birthDate;
    <http://dbpedia.org/ontology/activeYearsStartYear>
      ?firstActive.
}

The subject ?s will point back to this page.

The first predicate was the dbpedia-owl:birthDate link copied from the page.

The first object ?birthDate will be the info returned from the query.

It is followed by a semicolon instead of a period meaning repeat the same subject for the next predicate - object pair. I use indentation to remind myself that I am seeking info about the same subject, but the indentation is not significant to the SPARQL engine.

The second predicate, was the dbpedia-owl:ActiveYearsStartYear link copied from the page.

The second object, will be when she became active musically, presumably when she appeared on the world's stage, rather than the first time she sang or played a guitar.

RUN QUERY


See the results.

FILTERING DATES

Your first attempt to filter by a date in the form of yyyy-mm-dd will fail - giving you the dreaded empty column headers feedback.


SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/birthDate>
      ?birthdate;
    <http://dbpedia.org/ontology/activeYearsStartYear>
      ?firstActive.
 FILTER(?birthdate > "1940-01-01").
}

RUN QUERY


See the results.


FILTERING DATES USING DATETIME TYPE

Looking at the SPARQL manual on 11 Testing Values you see that you can't match a dateTime against a simple string, but rather you must match against a dateTime string as indicated by the 2 circumflex characters ^^ then the prefix xsd: and finally the datatype, dateTime.

By matching a dateTime to a dateTime, you will not see your results.


SELECT * WHERE 
{?s 
    <http://dbpedia.org/ontology/birthDate>
      ?birthdate;
    <http://dbpedia.org/ontology/activeYearsStartYear>
      ?firstActive.
 FILTER(?birthdate > "1940-01-01"^^xsd:dateTime).
}

RUN QUERY


See the results.

COMPARING STRINGS TO STRINGS

The same results would be obtained with the following filter condition:

FILTER(str(?birthdate) > "1940-01-01"). because you are first converting the ?birthdate to a string before comparing it to another string.

The str function is also handy for changing URI's (or web identifiers) into strings for simpler filtering.

Tuesday, August 24, 2010

Retrieving Other Info From a Dbpedia Page

While still using the dbpedia Virtuoso browser, http://dbpedia.org/sparql, pointing to a single page about Joan Baez, lets explore what it takes to retrieve different types of data.

DISPLAYING THE ABSTRACT


The subject ?s will point back to this page.

The predicate was the dbpedia-owl:abstract link copied from the page.

The object ?o will be the abstract returned from the query.

RUN QUERY


See the results.

GET ONLY THE ENGLISH ABSTRACTS

The last query returned abstracts in many different languages. To return just the English version, we modify the Sparql query as follows:
select distinct * 
where {?s <http://dbpedia.org/ontology/abstract> ?o .
 FILTER(langMatches(lang(?o),"en")).
}

The filter condition is how we specify the language tag associated with English,
abbreviated "en". The built-in function, langMatches provides the capability to detect the language associated with a column.

RUN QUERY


See the results.

GET ONLY THE ENGLISH ABSTRACTS OR ABSTRACTS AS SIMPLE STRINGS

Sometimes you might have no language tag at all, just a simple text string. To pick that up, you need the UNION operator to handle this possibility.

Note:The clause describing the abstracts with the @en label, for English, is enclosed in braces {} followed by the keyword UNION and then braces must enclose the clause for label-less abstracts.

select distinct * 
where 
{
{
  ?s <http://dbpedia.org/ontology/abstract> ?o .
  FILTER(langMatches(lang(?o),"en")).
 
 }
 UNION
 {
  ?s <http://dbpedia.org/ontology/abstract> ?o .
  FILTER(!langMatches(lang(?o),"*")).
 }
} 

The second filter condition is how we specify a string without a language tag, or rather, it does not match the object that has a language value of *, meaning any language.

In this case it doesn't make a difference because all abstracts include a language tag.

RUN QUERY


See the results.

Monday, August 23, 2010

Querying a Single DBPedia Page - for Joan Baez

Still using the dbpedia Virtuoso browser, http://dbpedia.org/sparql, I applied the last query to a single dbpedia page by specifying the default graph uri as http://dbpedia.org/resource/Joan_Baez. This will show the various ways dbpedia classifies Joan Baez.

Virtuoso SPARQL Query Form


RUN QUERY


See the results.

COMPARE TO THE DBPEDIA PAGE FOR JOAN BAEZ

When starting with SPARQL, I often pored over the dbpedia resource file in order to compare it with my SPARQL output to see if I got the SPARQL code right. These results match the entries under rdf:type (the predicate which was abbreviated as a).

Note: I did not specify the Joan Baez human readable page, http://dbpedia.org/page/Joan_Baez, in the Default Graph URI.

Instead, I specified the RDF page, http://dbpedia.org/resource/Joan_Baez" which is designed for SPARQL querying. However, if you point your browser to to resource page, dbpedia will show you the human readable page. The Sparql query engine however can go straight to the underlying RDF.

Tuesday, August 17, 2010

An Online SPARQL Query Form

I must apologize to my reader(s) about hogging all the Sparql fun. Here I've been querying and revising queries and not even given you a bit of help on running your own Sparql queries. Forgive me.

Virtuoso SPARQL Query Form

 The link for the website is http://dbpedia.org/sparql



I like it for querying dbpedia because of it's immediacy. I put in the Sparql query and get out my answer, nothing to install on my machine, sweet and simple SPARQL power over the web.

Default Graph URI

The default graph textbox allows you to tell SPARQL what information source to query when you do not explicitly name a source in your query. The Virtuoso browser prefills the field with http://dbpedia.org meaning query all of dbpedia.

THE QUERY TEXT

The query text box is where you enter your SPARQL query. I generally open a second window pointing to the Virtuoso query, then I can copy the SPARQL queries from this blog or any other SPARQL learning resource, put it in this box, and run it. That way I can not only see the queries in action, but can tweak them and explore a little further on my own.

select distinct ?Concept where { [] a ?Concept }

This says: Show me the distinct Concepts in http://dbpedia.org.

The symbols [] indicates I don't care what the subject is.

The predicate a is shorthand for rdf:type. I translate it as "is a" .

The object is called a ?Concept.


DISPLAY RESULTS AS

The Virtuoso browser provides many different output formats. I use HTML most of the time. If you write a query that outputs only a few columns and a few records, then running the same query in each of the different formats is a very clear way to appreciate the differences between each output format.


RIGOROUS CHECK OF THE QUERY

The rigorous check checkbox probably provides more feedback about less desirable SPARQL, but unchecking it won't help you run a query with fatal errors. I personally find the SPARQL compiler's error feedback less helpful than the feedback from Oracle's sql compiler, but Oracle has been around a lot longer than SPARQL.


EXECUTION TIMEOUT IN MILLISECONDS

I haven't used it.


RUN QUERY

When you click the Run Query button you can see the output of your query in the format your specified, in this example, HTML. Such a raw dump of all the distinct concepts in dbpedia shows an enormous number of concepts that appear unintelligible. Farther down the list you see concepts that make sense even at a casual glance.

See the results.

Monday, August 16, 2010

All Simpson's Blackboard gags - Filtered and Ordered

FILTER AND ORDER THE RESULTS

There are extra rows showing because a season sometimes has more than 1 label assigned to it. We now want to FILTER the rows to just those containing The Simpsons episodes in the season title.

THE SPARQL QUERY - FILTERED AND ORDERED

SELECT distinct ?season_title,?episode,?chalkboard_gag
WHERE 
{
 ?episode
   <http://www.w3.org/2004/02/skos/core#subject>
     ?season
 .
 ?episode
   <http://dbpedia.org/property/blackboard>
     ?chalkboard_gag
 .
     ?season
       <http://www.w3.org/2000/01/rdf-schema#label>
         ?season_title
     .
    FILTER(regex(?season_title,"The Simpsons episodes","i")).
}
ORDER BY ?season_title
Show results

FILTER WITH REGULAR EXPRESSIONS

13)FILTER(regex(?season_title,"The Simpsons episodes","i")).

  • Line 13) FILTER lets us restrict the rows in our output to only those that make our expression true. These are similar to Oracle's where clauses. In this example we used the regex function in a very simple way to search ?season_title for the string, The Simpsons episodes, ignoring any variations in uppercase or lower case letters.

ORDER THE RESULTS

15) ORDER BY ?season_title

Line 15) We would like to order by season title, so we see the gags over time. This command orders the title as a string, at least keeping the seasons together, but putting season 1 and season 11 before season 2.
We could work harder to isolate the number at the end of the season and sort on its
numeric representation, but let's not.
Some episodes have an airdate, the first time the episode was shown, but the data item is too sparse to order all our results. Let's move on.

All Simpson's Blackboard gags - With Season Title

DISPLAY THE SEASON LABEL







THE QUERY DIAGRAM


THE SPARQL QUERY - WITH THE SEASON TITLE

SELECT distinct ?season_title,?episode,?chalkboard_gag
WHERE 
{
 ?episode
   <http://www.w3.org/2004/02/skos/core#subject>
     ?season
 .
 ?episode
   <http://dbpedia.org/property/blackboard>
     ?chalkboard_gag
 .
   ?season
       <http://www.w3.org/2000/01/rdf-schema#label>
         ?season_title
     .
}
Show results

PICK UP THE SEASON TITLE

12)?season
13) <http://www.w3.org/2000/01/rdf-schema#label>
14) ?season_title
15).


Line 12) For each season
Line 13) return the it's user-friendly label
Line 14) and call it ?season_title.