Friday, August 27, 2010

Counting and Summing - US States and their populations

Lets switch to the Census Bureau's SPARQL test page,http://www.rdfabout.com/demo/census/sparql.xpd, and find out some information about the 50 states of the United States.

It always helps to have the sparql engine give some guidance, or some starting queries. Still the bulk of your SPARQL queries will come from self-discovery of the subjects, predicates and the objects in the database.

Note: After playing a little with the Census Bureau's SPARQL test page, I switched back to the dbpedia Virtuoso browser, http://dbpedia.org/sparql because its has better support for aggregate functions such as COUNT and SUM, and it is easier to save the queries I am discussing and easier to show you the live results.


ANALYZING THE QUERY ABOUT THE UNITED STATES

Let's look at the query in the query box more closely.

<http://www.rdfabout.com/rdf/usgov/geo/us> is our subject.

?p represents any predicate, or column name about the US

?o represents any object or column value.

This is a very common way to initially explore what the data set contains about the subject you are interested in. After you see the values returned, you can begin to restrict the predicates or objects to close in on what you are interested in. This wild card type of querying also helps with copying a workable predicate or object in case your first attempts at removing the PREFIX fail.

FILTER(!isBlank(?o)) . - removes predicates without meaningful information.

FILTER(!regex(str(?o), "zcta")) . removes about 100 different census groups, subdivisions of the United States that are probably only meaningful to the census team.


See the results.

JUST SHOW THE STATES

We'll focus on just the states by specifying the predicate <http://purl.org/dc/terms/hasPart> using the PREFIX representation.

We'll also order by state to make duplicate checking easier.


PREFIX     rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms:  <http://purl.org/dc/terms/>

SELECT * WHERE {
  <http://www.rdfabout.com/rdf/usgov/geo/us>
    dcterms:hasPart
     ?o .
  FILTER(!isBlank(?o)) .
  FILTER(!regex(str(?o), "zcta")) .
}
order by ?o
See the results.

Now we have only states, but do we have all 50 states?

LETS COUNT THE STATES

Counting in SPARQL is a little bit of a pain with the virtuoso browser because you have to feed the count function another, duplicate set of data. In this case, I called the objects ?state when counting them.


PREFIX     geo: 
PREFIX     rdf: 
PREFIX dcterms: 

SELECT ?o, count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
    dcterms:hasPart
     ?o .
  FILTER(!isBlank(?o)) .
  FILTER(!regex(str(?o), "zcta")) .
 } UNION {
  <http://www.rdfabout.com/rdf/usgov/geo/us>
    dcterms:hasPart
     ?state .
  FILTER(!isBlank(?state)) .
  FILTER(!regex(str(?state), "zcta")) .
 }
}
order by ?o
See the results.

We got 52 states. Let's add the state name to find the extra records.


WHAT IS THE STATE NAME CALLED?

To find out what the state name is called, we have to explore the state data by getting another set of triplets using the clause,?o ?p2 ?o2. This means, for the same object return all predicates identified as p2 and those detail objects, identified as ?o2.

We also add the new fields ?p2 and ?o2 to our select list, the variables we want to display.


PREFIX     rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms:  <http://purl.org/dc/terms/>

SELECT ?o, count(distinct ?state) as ?stateCount, ?p2, ?o2 WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
          ?o
               ?p2
                  ?o2
              .
          
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
  

See the results.

I search all the records for the string Alaska and find it as the object of the predicate, http://purl.org/dc/elements/1.1/title. That is the predicate for state name.


DISPLAY THE STATE NAME

We will break the state name predicate into 2 pieces and now show just the fields we really want.


PREFIX           rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX       dcterms:  <http://purl.org/dc/terms/>
PREFIX    dcelements:  <http://purl.org/dc/elements/1.1/>

SELECT ?o,?stateName, count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
            ?o
               dcelements:title
                  ?stateName            
             .
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
  

See the results.

Now I see why we have 52 records listed -- the count record adds 1 to the tally and the District of Columbia appears as a part of the United States.


DISPLAY THE STATE POPULATION

Going back to the same query that revealed the state name, I gathered that the state population was given by
http://www.rdfabout.com/rdf/schema/census/population.
I broke the population predicate into 2 pieces and now also show the population in the select list.


PREFIX        census: <http://www.rdfabout.com/rdf/schema/census/>
PREFIX    dcelements: <http://purl.org/dc/elements/1.1/>
PREFIX       dcterms: <http://purl.org/dc/terms/>
PREFIX           rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


SELECT ?o, ?stateName,?population, count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
            ?o
               dcelements:title
                  ?stateName;            
               census:population
                  ?population
              .
          
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
   

See the results.

Now I see population figures for each state, but I'm not sure at what time they were recorded. I would need to check them against other sources as well if I was trying to speak authoritatively about population.


SUM THE STATE POPULATIONS TO CALCULATE THE TOTAL US POPULATION

Now in the select list I'll sum up the individual state population figures that was produced in the 2nd unioned subquery that I had previously used to compute the state count. This produces a grand total population.


PREFIX        census: <http://www.rdfabout.com/rdf/schema/census/>
PREFIX    dcelements: <http://purl.org/dc/elements/1.1/>
PREFIX       dcterms: <http://purl.org/dc/terms/>
PREFIX           rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


SELECT ?o, ?stateName, 
?population,sum(?statePopulation) as ?totalPopulation,
count(distinct ?state) as ?stateCount WHERE 
{{
  <http://www.rdfabout.com/rdf/usgov/geo/us>
        dcterms:hasPart
           ?o .
       FILTER(!isBlank(?o)) .
       FILTER(!regex(str(?o), "zcta")) .
            ?o
               dcelements:title
                  ?stateName;            
               census:population
                  ?population
              .
          
         }
         UNION 
         {
          <http://www.rdfabout.com/rdf/usgov/geo/us>
             dcterms:hasPart
               ?state .
                ?state
                 census:population
                   ?statePopulation
               .
            
           FILTER(!isBlank(?state)) .
           FILTER(!regex(str(?state), "zcta")) .
         }
        }
        order by ?o
   

See the results.

Now I see total population for the US, but at some unspecified period of time.

No comments:

Post a Comment