Toronto: Get subjects for agenda items
patcon opened this issue · comments
The "City Subject Thesaurus" page provides a taxonomy for 311 and apparently TMMIS:
It seems that internally, agenda items are classified, but I'm unsure how to get that information out. The only way I can get items on a specific topic is by advanced search limited to "topic keywords", and then searching based on a quoted string that I know to be a keyword (from the thesaurus).
http://app.toronto.ca/tmmis/findAgendaItem.do?function=doPrepare
Perhaps I could do this search for the top x levels of the taxonomy, but this strike me as kinda messy. I'll reach out later to figure out of there's a better way.
Also, on a positive note, it seems that at least some degree of subject tagging happens before the session date, so this will be useful for upcoming agenda items as well:
I'll have to reach out to find out whether additional tagging happens after the meeting.
I did some digging around in the subject thesaurus xml file, using the xpath
command:
xpath -q -e '/THESAURUS/CONCEPT/*[self::DESCRIPTOR or self::NT]' subject_thesaurus.xml
Seems that DESCRIPTOR tags are the only thing that agenda items are tagged with internally. Also, there are several types of descriptors:
- top-level meta-category
- Examples:
CULTURE (SC)
EDUCATION (SC)
- Examples:
- meta-subcategory
- Examples:
[people in culture]
[activities in culture]
- Examples:
- term
- Examples:
faith communities
performers
poet laureates
- Examples:
Agenda items themselves are only ever tagged with terms. As far as I can tell, pseudonyms turn up results for their associated "official" terms, but child terms don't show up for parent terms. For example, "abandoneed bicycles" is a child of "abandonned vehicles", but searching the latter returns different results from former.
So my thinking is that it would be great to map the taxonomy, and use the search to find out all the tagged topics for agenda items. Then we could liberally apply parent topics. So something tagged with "poet laureates" would also get a tag for "people in culture" and "culture".
This should give us a good basis for tagging AgendaItems/Bills with topics/subjects (making it a little easier to filter upcoming agenda items by interest in our specific app)
And I'll probably ignore AgendaItem subjects and just attach subjects only to Bill objects
Ugh. 2357 terms. And we need to do a search for each. OK, so if we want to make this useful for upcoming agenda items, we'll probably need to do a full search, then find out which terms are most popular, and only search those for this bills-incremental.py
scraper, or whatever we call it.
Just realized I should be using arrays rather than dicts, but here's a simple parser for the taxonomy that we can use to tag:
https://gist.github.com/patcon/62533121ac27d3c7873b
uri for incremental updates: http://app.toronto.ca/tmmis/findAgendaItem.do?function=doSearch&includeStaffRec=false&includeKeywords=on&includeDecision=false&includeTitle=false&includeSummary=false&fromDate=2016-01-24&toDate=2016-02-24&word=housing
Got a reply from Matthew at Clerk's office:
We don't have the data in a form that can be released as a dataset. We'll take the suggestion under advisement for future development.
So yep, seems there's no way except advanced search to get this out of the system atm.
Just sent this:
Hey Matthew,
Any recommendations on operators that can be used during search? When I use "[" or "]", I see an error mentioning "OR" and "AND" usage, but not sure much else.
Particularly interested in searching "dogs" for topic under advanced search, and not getting agenda items returned that are tagged "guide dogs" or "hotdogs" :)
Thanks,
Patrick
Bumped this email thread with Matthew today, as it's become relevant again.