opencivicdata / scrapers-ca

Canadian legislative scrapers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Toronto: Get subjects for agenda items

patcon opened this issue · comments

The "City Subject Thesaurus" page provides a taxonomy for 311 and apparently TMMIS:

http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=5cdbe49bf2f35310VgnVCM1000003dd60f89RCRD&vgnextchannel=0186e03bb8d1e310VgnVCM10000071d60f89RCRD

It seems that internally, agenda items are classified, but I'm unsure how to get that information out. The only way I can get items on a specific topic is by advanced search limited to "topic keywords", and then searching based on a quoted string that I know to be a keyword (from the thesaurus).

http://app.toronto.ca/tmmis/findAgendaItem.do?function=doPrepare

Perhaps I could do this search for the top x levels of the taxonomy, but this strike me as kinda messy. I'll reach out later to figure out of there's a better way.

Also, on a positive note, it seems that at least some degree of subject tagging happens before the session date, so this will be useful for upcoming agenda items as well:

http://app.toronto.ca/tmmis/findAgendaItem.do?function=doSearch&fromDate=2016-01-07&toDate=2017-01-01&word=%22permits%22&includeKeywords=on

I'll have to reach out to find out whether additional tagging happens after the meeting.

I did some digging around in the subject thesaurus xml file, using the xpath command:

xpath -q -e '/THESAURUS/CONCEPT/*[self::DESCRIPTOR or self::NT]' subject_thesaurus.xml

Seems that DESCRIPTOR tags are the only thing that agenda items are tagged with internally. Also, there are several types of descriptors:

  • top-level meta-category
    • Examples:
      • CULTURE (SC)
      • EDUCATION (SC)
  • meta-subcategory
    • Examples:
      • [people in culture]
      • [activities in culture]
  • term
    • Examples:
      • faith communities
      • performers
      • poet laureates

Agenda items themselves are only ever tagged with terms. As far as I can tell, pseudonyms turn up results for their associated "official" terms, but child terms don't show up for parent terms. For example, "abandoneed bicycles" is a child of "abandonned vehicles", but searching the latter returns different results from former.

So my thinking is that it would be great to map the taxonomy, and use the search to find out all the tagged topics for agenda items. Then we could liberally apply parent topics. So something tagged with "poet laureates" would also get a tag for "people in culture" and "culture".

This should give us a good basis for tagging AgendaItems/Bills with topics/subjects (making it a little easier to filter upcoming agenda items by interest in our specific app)

And I'll probably ignore AgendaItem subjects and just attach subjects only to Bill objects

Ugh. 2357 terms. And we need to do a search for each. OK, so if we want to make this useful for upcoming agenda items, we'll probably need to do a full search, then find out which terms are most popular, and only search those for this bills-incremental.py scraper, or whatever we call it.

Got a reply from Matthew at Clerk's office:

We don't have the data in a form that can be released as a dataset. We'll take the suggestion under advisement for future development.

So yep, seems there's no way except advanced search to get this out of the system atm.

Just sent this:

Hey Matthew,

Any recommendations on operators that can be used during search? When I use "[" or "]", I see an error mentioning "OR" and "AND" usage, but not sure much else.

Particularly interested in searching "dogs" for topic under advanced search, and not getting agenda items returned that are tagged "guide dogs" or "hotdogs" :)

Thanks,
Patrick

Bumped this email thread with Matthew today, as it's become relevant again.