Friday, July 30, 2010

All Simpson's Blackboard gags

Let's expand our second query and get all the blackboard gags from all the Simpson seasons and episodes.

THE QUERY DIAGRAM


THE SPARQL QUERY - ANY EPISODE ANY SEASON WITH A BLACKBOARD GAG

SELECT distinct ?episode,?chalkboard_gag
WHERE 
{
 ?episode
   <http://www.w3.org/2004/02/skos/core#subject>
     ?season
 .
 ?episode
   <http://dbpedia.org/property/blackboard>
     ?chalkboard_gag
 .
}
Show results.

GET ANY EPISODE - ANY SEASON

4) ?episode  
5)  <http://www.w3.org/2004/02/skos/core#subject>  
6)   ?season  
7) .
 
Note: I'll only discuss the newest items of interest instead of endlessly repeating that the select clause lists the variables you want to see.
  • Line 4) The subject we are talking about.
  • Line 5) We already learned that the skos:subject referred to the season this episode belonged to.
  • Line 6)Instead of limiting our results to The Simpsons season 12, we will put in ?season, meaning any season.
  • Line 7) The period marks the end of the first selection clause.
  • Recapping, lines 4-7 ask the SPARQL engine to search all off dbpedia to find the page(s) that I will call ?episode that have a skos:subject value of anything. We will call that anything a season. We will have to count on the 2nd clause to restrict our results to ?episodes that have a blackboard variable for the blackboard gag.

LETS CHECK OUR RESULTS

The results look like Simpson blackboard gags, but lets make sure.

Thursday, July 29, 2010

The Simpson's Blackboard gags for Season 12

Let's expand your first query and get all the blackboard gags from the 12th Simpson season.

STARTING FROM ONE EPISODE

  • We'll start with the first sparql query's dbpedia page, the dbpedia version of The Worst Episode.
  • Here are the dbpedia page's sections identifying the episode, the blackboard gag for this episode, and lastly, the season this episode is part of. We will use this information for our query.



Note: click any picture to enlarge it.

THE QUERY DIAGRAM

THE SPARQL QUERY - SEASON 12 EPISODES WITH BLACKBOARD GAGS

SELECT distinct ?episode,?chalkboard_gag
WHERE 
{
 ?episode
   <http://www.w3.org/2004/02/skos/core#subject>
     <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>
 .
 ?episode
   <http://dbpedia.org/property/blackboard>
     ?chalkboard_gag
 .
}
Show results.

SPECIFY YOUR OUTPUT VARIABLES

1) SELECT distinct ?episode,?chalkboard_gag
  • The select clause specifies what fields you want in your output.
  • The distinct keyword removes duplicate values of ?episode - ?chalkboard_gag combinations.
  • I specified the same two variables, ?episode and ?chalkboard_gag.

START OF YOUR SELECTION CRITERIA

2) WHERE { . . . } Enough said.

IDENTIFY THE EPISODES IN SEASON 12







 
4) ?episode  the subject
5)  <http://www.w3.org/2004/02/skos/core#subject>  the predicate
6)   <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>  the object
7) .
 
  • Line 4) The subject we are talking about.
  • Line 5) At first sight, skos:subject didn't mean much to me, but when I saw its value refered to the Simpson's season 12, I figured it had to describe the season that this episode belonged to. I copied the link's value into my sparql query.
  • Line 6) For the object, I picked up the value of the long link to the right of skos:subject.
  • Line 7) The period marks the end of the first selection clause.
  • Recapping, lines 4-7 ask the SPARQL engine to search all off dbpedia to find the page(s) that I will call ?episode that have a skos:subject value of http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12.

GET THE BLACKBOARD GAG


8) ?episode  the subject 
9)  <http://dbpedia.org/property/blackboard>  the predicate 
10)   ?chalkboard_gag  the object 
11) .

  • Line 8) By using the same variable, ?episode as subject in lines 4 and 8, I am telling SPARQL that I want to see the blackboard gag for the episode defined above.
  • Line 9) I copied the link,dbprop:blackboard from the dbpedia page as mentioned before.
  • Line 10) I specified the object as a variable called ?chalkboard_gag, meaning pick it up from the page (or really from the RDF data underlying the page).

    This will now show different blackboard gags for different episodes.
  • Line 11) End of the 2nd search clause.
  • Line 12)The closing bracket ends your selection criteria.

LETS CHECK OUR RESULTS

So that's your second query. Let's compare our results. versus The object link to the Simpson episodes season 12.

We have a little problem. Our results show 7 gags and season 12 had 21 episodes. What gives?
Not all the episodes had blackboard gags, so lets ask for each episode in season 12 and then optionally show the blackboard gag if it exists.

SPARQL QUERY WITH THE BLACKBOARD GAG OPTIONAL

SELECT distinct ?episode,?chalkboard_gag
WHERE 
{
 ?episode
   <http://www.w3.org/2004/02/skos/core#subject>
<http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>
 .
 OPTIONAL
 { 
 ?episode
   <http://dbpedia.org/property/blackboard>
     ?chalkboard_gag
 .
 } # end of optional blackboard gag
} # end of where clause

Show results
Now there are 21 episodes and still the same 7 blackboard gags. That clears up the mystery.

Thursday, July 22, 2010

Your first Sparql Query

Let's jump right in and make your first query. We will get the blackboard gag from a single Simpson episode, The Worst Episode Ever.

FIND THE RDF DATA TO QUERY

  • I searched Google for "simpson episodes complete list".
  • I find a page with a fairly comprehensive list of seasons. I select season 12.
  • From the page for season 12, I pick a catchy title, Worst Episode Ever. This returns a wikipedia article about an individual episode I am interested in but the wiki page is designed for people to read and enjoy.
  • I want to find the dbpedia version of the page to make my Sparql querying simpler. I query google for dbpedia + "Worst Episode Ever" and find the the dbpedia version of this page.
  • Here is the dbpedia page's abstract, the section identifying the episode and the blackboard gag for this episode. We will use this information for our query.

    Note: click any picture to enlarge it.








THE QUERY DIAGRAM

 

THE SPARQL QUERY - 1 EPISODE'S BLACKBOARD GAG

SELECT distinct ?episode,?chalkboard_gag
WHERE 
{
 ?episode
   <http://xmlns.com/foaf/0.1/page>
     <http://en.wikipedia.org/wiki/Worst_Episode_Ever>
 .
 ?episode
   <http://dbpedia.org/property/blackboard>
     ?chalkboard_gag
 .
}

Show results.


SPECIFY YOUR OUTPUT VARIABLES

1) SELECT distinct ?episode,?chalkboard_gag
  • The select clause specifies what fields you want in your output.
  • The distinct keyword removes duplicate values of ?episode - ?chalkboard_gag combinations. I usually use distinct to get the output as compact as possible in terms of the number of rows returned.
  • Variables are indicated by a question mark before an identifier that you specify. My two variables are ?episode and ?chalkboard_gag, although I could have called them something else.

START OF YOUR SELECTION CRITERIA

2) WHERE { . . . }

Within the brackets is the specification of which rows out of the potentially billions of rows in the RDF datastore you want to see at this moment.

IDENTIFY THE EPISODE

4) ?episode  the subject
5)  <http://xmlns.com/foaf/0.1/page>  the predicate
6)   <http://en.wikipedia.org/wiki/Worst_Episode_Ever>  the object
7) .

  • Line 4) As in English, the subject must be established so we know what we are talking about. We make the subject a variable by using a leading question mark.
  • Line 5) If you examine the dbpedia page, there are several ways to identify this episode, something akin to Oracle's primary key. I chose to identify the record by its foaf:page.








    The link foaf:page is my predicate and I need to capture its web address.
    In Firefox I simply right click the link and select Copy link location.
    In Internet Explorer, I right click the link, select properties, and then copy the link from the address (URL) field.
    This gives me http://xmlns.com/foaf/0.1/page. I enclose it in angled brackets, making it <http://xmlns.com/foaf/0.1/page>. Now I have a sparql predicate.
  • Line 6) To pick up the object, or value of foaf:page, I do the same to the link on the right. Enclosing it in angled brackets gives me:

    <http://en.wikipedia.org/wiki/Worst_Episode_Ever>
  • Line 7) The period marks the end of the first selection clause. It doesn't have to be on a separate line, but I like to use it on its own line to better separate each clause, something important to me as the complexity of queries grows.
  • Recapping, lines 4-7 ask the SPARQL engine to search all off dbpedia to find the page(s) that I will call ?episode that have a foaf:page value of http://en.wikipedia.org/wiki/Worst_Episode_Ever.

GET THE BLACKBOARD GAG


8) ?episode  the subject 
9)  <http://dbpedia.org/property/blackboard>  the predicate 
10)   ?chalkboard_gag  the object 
11) .

  • Line 8) By using the same variable, ?episode as subject in lines 4 and 8, I am telling SPARQL that I want to see the blackboard gag for the episode defined above.
  • Line 9) I copied the link,dbprop:blackboard from the dbpedia page as mentioned before.
  • Line 10) I specified the object as a variable called ?chalkboard_gag, meaning pick it up from the page (or really from the RDF data underlying the page).
    This will show: "I will not hide the teacher's medicine."
  • Line 11) End of the 2nd search clause.
Line 12) The closing bracket ends your selection criteria.
So that's your first query. Complicated at first, but as you do more and more, it will become second nature to you.

Wednesday, July 7, 2010

What is the RDF data format

RDF stands for Resource Description Framework. It is a way of representing information so that related facts can be easily combined.
For the technically inclined, see the

RDF Primer.

THE RELATIONAL APPROACH

In a typical database, you have tables that collect related information about a particular thing, i.e an employee. The table has columns such as id, name, age, etc. and each of those columns have values and usually types. The id might be a 4 digit number, the name could be a 40 character string, and the age a 3 digit number.

A RELATIONAL EMPLOYEE TABLE

Id Name Age
1 Bill Townsend 47
2 Mary Maxwell 33

The number of columns in tables vary from 1 to several hundred.
In this employee table, 1 row of the table stores all the various items about a single employee.

 

THE RDF APPROACH

Right now I am oversimplifying, but in the RDF model, the data would be stored something like this.

EMPLOYEE INFORMATION IN RDF FORM
?subject ?predicate ?object
<http://www.BillTownsend.com/me> name "Bill Townsend"
<http://www.BillTownsend.com/me> age 47
<http://www.bestandbrightest.com/mmaxwell> name "Mary Maxwell"
<http://www.bestandbrightest.com/mmaxwell> age 33

The basic rdf "table" structure will always have these same 3 columns.
In the RDF model, 1 row stores the subject, and 1 item of information about that subject. It will take many rows or triplets (subject,predicate,object combinations) to fully describe the subject.


The subject is a web address that the whole world could use to uniquely identify this person. Bill might have created that web page for his resume, but now that web address could be used by anyone to record information about him.


The predicates are comparable to the relational database world's column names. It's best to use predicates that are already known to the RDF/SPARQL community so people know what you are talking about.


The objects correspond to the database column values. They could be literals or other web addresses that become the subject of additional predicates and objects.

 

Graphical Representation of RDF Data

In my time grappling with Sparql and RDF, I've found it most helpful to plan my queries visually, and even to analyze queries visually. I will describe the convention I will use in this blog.

The subjects for my little diagrams will be rounded, because they must be web addresses.

The predicates will be labels on the arrows.

The objects will be square if they are literals and will be rounded if they are a web address, which can have other information hanging off it.


Thursday, July 1, 2010

What is the semantic web

The semantic web consists of web pages that are organized so that they not only make sense to people reading them, but allow software to easily pull information out of them.

I give thanks to the weblog of Bob DuCharme which showed how much fun Sparql could be. He showed a query used to get all the blackboard gags that Bart wrote on the blackboard during season 12. My first queries brought me a lot of laughs as I began to retrieve joke after joke without visiting each and every web page describing each episode.

I deal with government fishery databases every work day. I never realized that the developers of the Simpson's television series were methodically recording various details about each show in wikipedia and that these descriptions of each episode could be processed as if they were records in a database.  This is like the metadata that computer programmers are supposed to write to document their programs.

What is Linked Data from Internet pioneer Sir Tim Berners-Lee

Here is Sir Tim Berners-Lee on this at TED:


Michael Hausenblas's introductions:

The purpose of my Sparql Playground

I'd like to share my exploration of the Sparql query language with fellow seekers trying to treat the entire internet as one giant database.

I am very impressed by the giant brains who have developed the concepts and software that make it possible to query dbpedia for example to find out about countries or artists or the Bart Simpson television series, but often the presentation of the material is so complex, it makes my brain ache.

Sometimes someone explains a concept with such simple elegance, I grasp it easily and gratefully. This blog is my attempt to share my own small discoveries about the Sparql query language. Perhaps what makes sense to me will make your discoveries come quicker to you.

In this blog, I will include links back to the technical specification I am discussing. You might want to read the whole blog first to get a general idea of what I'm talking about, and then revisit the links and dig in as deeply as you'd care to. Feel free to back out of a link if it's too technical or doesn't make sense to you. Relax. This will be fun!!!