Getting the Pleaides Dataset in a TDB Triple Store

Another excellent gazetteer for which I have found use is the Pleiades gazetteer of the ancient world. It’s native format is RDF, so one needs a triple store rather than a standard relational database to query it. I’m a huge fan of Apache’s open source projects, so I used Apache Jena and TDB.

I got the data set in Turtle format at the compressed tar archive of the latest data as described on the page at http://pleiades.stoa.org/downloads. Having uncompressed it, I attempted to load it into an empty local directory with the TDB command line tool tdbloader2:


tdbloader2 --loc Documents/TDBPleiades/ ~/Downloads/pleiades-latest/*

The loc parameter is where you would like your triple store to go. It needs to be an empty directory.

However, using latest data from Feb 26, 2014, I got the following error on the places-5.ttl file:


ERROR [line: 48513, col: 84] Triples not terminated by DOT
org.apache.jena.riot.RiotException: [line: 48513, col: 84] Triples not terminated by DOT

This was easily rectified by deleting a spurious ‘>’ character at the end of the indicated line.

With the triples in the store, I could then query with at the command line with a command such as


tdbquery --query /Users/jcreel/Development/SPARQL-work/query --loc=/Users/jcreel/Documents/TDBPleiades/

with a query file such as


SELECT ?x WHERE { ?x <http://pleiades.stoa.org/places/vocab#hasName> <http://pleiades.stoa.org/places/579885/athens>}

I could get results of the form


--------------------------------------------
| x                                        |
============================================
| <http://pleiades.stoa.org/places/579885> |
--------------------------------------------

Getting page content by title from a MediaWiki database

It always takes a little bit of time and cognitive effort to reconstitute a query with a few joins or get the identifiers of some obscure resources. For this reason, I like to use this blog to keep track of non-trivial data and techniques that I find myself having to use frequently but not every day.

Here’s another example I’m putting up here for convenience: getting from a MediaWiki page title to its current content (in a Postgres database – note that the table names are slightly different in MySQL). There are three tables at play with some slightly peculiar names for one unfamiliar with the schema.

For the name ‘Houston':


SELECT page_title, old_text
FROM page
JOIN revision ON page.page_id = revision.rev_page
JOIN pagecontent ON revision.rev_id = pagecontent.old_id
WHERE page_title = 'Houston';

Utilizing the GeoNames Dataset in Postgres

So, you have a use case that calls for you to repeatedly get lots of data from the fabulous (and big) dataset of the GeoNames gazetteer. It’s best in such a case to get the database set up locally, as the web service will limit the queries you can make. How best can one get started? The web service has a robust API, but it is non-trivial to recapitulate basic functionality even after shoehorning the data into your local database. My use case didn’t need to cover a lot of the functions in the API, but maybe you will find it helpful.

The resources on the World Wide Web and the very nice GeoNames Forum go a long way toward getting your local database populated with the GeoNames data. In this post, I’ll provide an account of my own experiences downloading, ingesting, and querying the data.

I’m no DB administrator, but I tend to favor Postgres over MySQL for scalability reasons (although I’ve heard lately that MySQL has improved by using innoDB rather than MyISAM by default – do feel free to correct me in the comments). The best resource at the time of this writing on the GeoNames Forum for Postgres imports is the post entitled Import GeoNames dump into Postgres.

You’ll need to get the data dumps for the content you want. Find the files you need at the GeoNames dump. I use several of these table dumps for my use cases:


admin1CodesASCII.txt
admin2Codes.txt
allCountries.txt
alternateNames.txt
countryInfo.txt
hierarchy.zip

Note that hierarchy.zip unzips to hierarchy.txt .

In addition, to get the continents into a table, you need another text file that is not provided in the dump, but was very simple for me to come up with:

geonames-continents.txt


AF  6255146
AS  6255147
EU  6255148
NA  6255149
SA  6255150
OC  6255151
AN  6255152

On the GeoNames forum page about this, the very helpful commentator marc notes that given a suitable SQL file of CREATE TABLE statements one can schematize (i.e. create the tables for) one’s Postgres database with a simple SQL file using the command

psql dabase_name < create.sql

and then copy in the data from your dump downloads with a command of the form

copy <tablename> (<column1name, column2name ... columnnname>) from '<dumpfileame>' null as '';

Well, how’s that for inscrutable? Let me clarify; the tablename might be, for example, “geoname”; the columns then would be such as “geonameid,name,asciiname,alternatenames,latitude,longitude” etc., and the dumpfile is merely what one downloads at the GeoNames dump as noted above.

marc provides an example create.sql file at the top of that forum page. It seems that as new data have been added to GeoNames, some of the field values were too big to fit in their allotted number of varchars, and some new columns were added, so I had to make a few amendments to the create.sql. Here’s the version that worked for me:

create.sql


create table geoname (
         geonameid       int,
         name            varchar(200),
         asciiname        varchar(200),
         alternatenames  varchar(10000),
         latitude        float,
         longitude       float,
         fclass  char(1),
         fcode   varchar(10),
         country varchar(2),
         cc2 varchar(60),
         admin1  varchar(20),
         admin2  varchar(80),
         admin3  varchar(20),
         admin4  varchar(20),
         population      bigint,
         elevation       int,
         gtopo30         int,
         timezone varchar(40),
         moddate         date
 );
 
 create table alternatename (
         alternatenameId  int,
         geonameid          int,
         isoLanguage        varchar(7),
         alternateName     varchar(200),
         isPreferredName      boolean,
         isShortName    boolean,
         isColloquial    boolean,
         isHistoric    boolean
 );
 
 
 
 CREATE TABLE "countryinfo" (
      iso_alpha2 char(2),
      iso_alpha3 char(3),
      iso_numeric integer,
      fips_code character varying(3),
      name character varying(200),
      capital character varying(200),
      areainsqkm double precision,
      population integer,
      continent char(2),
      tld char(3),
      currencyCode char(3),
      currencyName varchar(40),
      phone varchar(20),
      postalCodeFormat varchar(60),
      postalCodeRegex varchar(200),
      languages character varying(200),
      geonameId int,
      neighbours varchar(60),
      equivalentFipsCode character varying(3)
 );

CREATE TABLE "hierarchy" (
  parentId int,
  childId int,
  type varchar(40) );

CREATE TABLE "admin1codes" (
  admin1 varchar(20),
  name varchar(200),
  asciiname varchar(200),
  geonameid int );

CREATE TABLE "admin2codes" (
  admin2 varchar(80),
  name varchar(200),
  asciiname varchar(200),
  geonameid int );

CREATE TABLE "continentinfo" (
  geonameid int,
  continent varchar(40) );

Some of the dumps may require some massaging before the copy command will work, however. I needed to delete the comments at the top of countryInfo.txt, for example.

For your convenience, I provide the copy commands that worked for me. Note that the text files should be in your present working directory.


\copy geoname (geonameid,name,asciiname,alternatenames,latitude,longitude,fclass,fcode,country,cc2,admin1,admin2,admin3,admin4,population,elevation,gtopo30,timezone,moddate) from allCountries.txt with null as ''

\copy alternatename  (alternatenameid,geonameid,isoLanguage,alternateName,isPreferredName,isShortName,isColloquial,isHistoric) from alternateNames.txt null as ''

\copy countryInfo  (iso_alpha2,iso_alpha3,iso_numeric,fips_code,name,capital,areainsqkm,population,continent,tld,currencyCode,currencyName,phone,postalCodeFormat,postalCodeRegex,languages,geonameId,neighbours,equivalentFipsCode) from countryInfo.txt null as ''

\copy hierarchy (parentId,childId,type) from hierarchy.txt null as ''

\copy admin1codes (admin1, name, asciiname, geonameid) from admin1CodesASCII.txt null as ''

\copy admin2codes (admin2, name, asciiname, geonameid) from admin2Codes.txt null as ''

\copy continentinfo (continent, geonameid) from geonames-continents.txt null as ''

Getting the data into the database is relatively straightforward compared with getting it out in informative ways. My use case required only a query by name and a query to return the hierarchy. Both of these presented interesting idiosyncrasies that I’ve endeavored to communicate on the GeoNames forums. There’s a thread for folks to share their queries that they have developed for use on their local databases: Collection of geonames SQL queries

A query to get the rows of places with a particular name (Spain), or having that name as an alternate name according to the alternatename table:


SELECT DISTINCT geoname.geonameid, name, fcode, latitude, longitude FROM geoname JOIN alternatename ON geoname.geonameid = alternatename.geonameid WHERE alternatename = 'Spain'
UNION SELECT DISTINCT   geonameid, name, fcode, latitude, longitude FROM geoname WHERE name = 'Spain';

And a query to navigate up the hierarchy according to the hierarchy table to get the places that contain your place of interest (here, GeoName id 6429727):


WITH RECURSIVE geonametree AS

(SELECT geoname.geonameid AS id, geoname.name AS name, hierarchy.parentid AS parent, hierarchy.childid AS child
FROM hierarchy
  JOIN geoname ON geoname.geonameid = hierarchy.childid 
  WHERE geoname.geonameid = 6429727

UNION ALL

SELECT geoname.geonameid AS id, geoname.name AS name, hi.parentid AS parent, hi.childid AS child
FROM hierarchy AS hi
  JOIN geoname ON geoname.geonameid = hi.parentid 
  INNER JOIN geonametree AS gt ON (hi.childid = gt.parent)
)
SELECT id, name, parent, child
FROM geonametree;

Last of all, let me tell you that any queries you might choose to run on those data will be intolerably slow without some indexes. Here are the ones I used to keep things running quickly:


create index altnameindex on alternatename (alternatename);
create index altnameidindex on alternatename (alternatenameid);
create index geonameidindex on geoname (geonameid);
create index geonamecountryindex on geoname (country);
create index countryindex on countryinfo (iso_alpha2);
create index hpindex on hierarchy (parentid);
create index hcindex on hierarchy (childid);
create index a1index on admin1codes (admin1);
create index a2index on admin2codes (admin2);
create index geonamename on geoname (name);

As always, if my esteemed reader should find any errors or want to make any suggestions about the procedure I’ve outlined, I welcome input in the comments.

The DSpace resource_type_id

I do declare, my fellow DSpace developers, it is handy indeed to have the resource_type_id of those various DSpace objects at one’s fingertips. One can, of course, infer such data with some excessively sophisticated queries to a DSpace database, but you and I both know that an search engine is the first go-to. I have repeatedly had to make queries to DSpace involving the resource_type_id. The entities thereby represented are of specific interest to our users and customers. Folks have very personal concerns regarding DSpace items, collections, communities, etc. Specifically, folks often need reports on the handles meeting various criteria. As a public service, I offer the current IDs of the DSpace resources at the database level. These IDs occur in some crucial tables, including, as of this writing, RESOURCEPOLICY, and HANDLE. The schema for DSpace 1.8 may be seen at DSpace 1.8 Storage Layer. Without further ado, I disclose the data:

bitstream: 0
bundles: 1
item: 2
collection: 3
community: 4

Digital Collections in the Texas A&M University Institutional Repository

Hello, fellow netizen! The interwebs are awesome! Check out this new digital log on the World Wide Web wherein the author endeavors to find mutual benefit and edification with those who seek knowledge and open access to information.

In this spirit, let me inaugurate the blog by presenting some prominent collections in the Texas A&M University (TAMU) digital repository, a DSpace repository housing a wealth of information. Although I am not a librarian, it has been my distinct privilege to perform a multitude of curation and crosswalking tasks during my service to the libraries since 2007 and to have imported the majority of the repository’s digital content. I can safely say that among the list of collections are a number of hidden gems that have thus far gotten little attention from search engines.

Enjoy!

Vascular Plant Image Gallery

This collection involved extensive collaboration with the Botany department to map their spreadsheets to the repository’s metadata schema. It was a real pleasure working with the botanists to crosswalk diverse information like genera, etc. into Dublin Core suitable for DSpace.

Retrospective Theses and Dissertations (1922-2004)

This set is a magnificent and historic resource in addition to being quite large (16334 items). It’s a product of extensive collaboration with TAMU Libraries’ Consortia Systems on converting catalog MARC to DC and MODS, and with the Texas Digital Library (TDL) and TAMU Library administrators on the best possible adherence to TDL ETD metadata standards. A great many of these theses are restricted to Texas A&M on-campus access only due to copyright reasons, but authors who would like to set their thesis free for the edification of the public can navigate their browser to the work and click the “Request Open Access” button.

Faculty Publications

Submissions are open to all faculty, but most of the inital content consists of journal articles published by the American Physical Society (http://journals.aps.org/), the publication rights to which our faculty colleague Jo Ann Bianci was able to obtain. Jo Ann prepared metadata for these which we processed and imported.

Energy Systems Laboratory

Our work with this outstanding engineering institute is one of our oldest collaborations; in fact, the custom script the ESL folks use was developed prior to my joining the Library, although for some years now your author has done the importations, occasionally having to deal with character encoding issues or file naming issues when they get a new student worker. Several conferences are represented in collections under this repository community dealing with increasingly important topics like energy efficiency.

Texas AgriLife Extension Service – Current Extension Publications Archive

A product of collaboration with the AgriLife bookstore. This collection facilitates access to some of their publications that are now available for free. I took their SQL database of metadata (which was a backend to the bookstore website) and converted it (with the supplied pdfs) to DSpace SAF for importation.

Texas Agricultural Experiment Station Bulletin

The bulletin contains scientific research and reports for the benefit of the public going back to 1888. It’s one of my favorite collections, with well-researched information on everything from organic farming to canoeing. The collection is the product of a long-running and highly successful collaboration with Dr. Rob McGeachin. He uses a custom Java program to convert metadata spreadsheets to DSpace SAF. Student workers scan the content documents.

Cushing images collections

– These have been the result of our processing of spreadsheets of metadata that the archivists have painstakingly compiled:

  • Wheelan Collection of Mexican Revolution Photographs
  • Toward a Better Living: African American Farming Communities in Mid-Century Texas
  • The Raiford L. Stripling Collection
  • World War I Postcards
  • Cushing Historical Images Collection

  • Historic Maps of Cuba

    I processed metadata provided in a spreadsheet by the Maps Library (much of which came originally from the catalog).

    Texas A&M University Press Consortium

    We met extensively with the head of the Texas A&M University Press, Charles Backus (and his staff), about converting their printing-format documents to pdfs for the repository and providing additional promotion for the Press’s output. So far, we have had the opportunity to import only a small subset of their publications.

    Finally,

    this post would not be complete without mention of some vital collections that have found their way into the repository by means other than your author’s importations.

    Geologic Atlas of the United States

    An early collaboration with the Maps Library that yielded a novel, outstanding Yahoo! Maps-based interface.

    Electronic Theses, Dissertations, and Records of Study (2002-present)

    This is where modern theses and dissertations end up instead of in the stacks. This content originates in the Vireo ETD management system. In past years, it was imported with a script, but now importation occurs cleanly and automatically via SWORD.