UDST / spandex

Spatial Analysis and Data Extraction

Home Page:http://nbviewer.ipython.org/github/synthicity/user_meeting_2014/blob/gh-pages/spandex/spandex_demo.ipynb

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

clarify the role of conventions so that users know whether spandex requires them to use exec_sql to edit databases in all cases

tbuckl opened this issue · comments

Its unclear to me as a user whether I can edit a database that Spandex is using/managing outside of the Spandex ORM without breaking the Spandex database class' understanding of the schema.

In short, the Spandex database class does not seem to show any existing tables on my public schema after dropping and then re-adding a table in psql and then inspecting the database with Spandex database class.

The long version:
This is a pretty simple and use-case-specific SQL query that takes a few minutes:

https://github.com/synthicity/spandex/blob/master/spandex/spatialtoolz.py#L495-L513

It works fine when you run this script from start to finish:

https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/run.py

Unfortunately, running that script from start to finish takes more than 12 hours on a well-provisioned (and tuned) machine.

As a user, it would be nice to be able to just call that specific SQL query on an arbitrary table. However, it seems that there may be some conventions or dependencies that I am not following in calling it.

In particular, I suspect that I am getting an error because I am not calling that function on a table that was specifically created or registered with one of the several ORM's (2 if you count Spandex as an ORM?) that seem to be in use in this repository.

The error is below. As a user, this means I will probably re-write the query from the ORM language into SQL in order to accomplish my larger goal of reducing the run-time of data regenerations.

---geom_aggregation1 took 7.37857508659 seconds ---
/home/vagrant/anaconda/lib/python2.7/site-packages/sqlalchemy/dialects/postgresql/base.py:2079: SAWarning: Did not recognize type 'bpchar' of column 'county_id'
  name, format_type, default, notnull, domains, enums, schema)
/home/vagrant/anaconda/lib/python2.7/site-packages/sqlalchemy/dialects/postgresql/base.py:2079: SAWarning: Did not recognize type 'unknown' of column 'imputation_flag'
  name, format_type, default, notnull, domains, enums, schema)
PARCEL AGGREGATION:  Merge geometries (and aggregate attributes) based on within-interior-ring status
Traceback (most recent call last):
  File "geom_aggregation2.py", line 16, in <module>
    df = geom_unfilled(t.public.parcels, 'unfilled')
AttributeError: type object 'public' has no attribute 'parcels'
Traceback (most recent call last):
  File "geom_test.py", line 30, in <module>
    check_run('geom_aggregation2.py')
  File "geom_test.py", line 22, in check_run
    return subprocess.check_call([python, path])
  File "/home/vagrant/anaconda/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/vagrant/anaconda/bin/python', 'geom_aggregation2.py']' returned non-zero exit status 1

Hi Tom. Assuming spandex understands the data types (to map them to Python types), you can edit the database outside of spandex. If you changed table schema while running spandex, you need run database.refresh().

I see two errors in the traceback:

  • Did not recognize type 'bpchar' of column 'county_id'
  • Did not recognize type 'unknown' of column 'imputation_flag'

bpchar (blank padded char) is an internal PostgreSQL type. Can you check on the data type of those columns? You could try explicitly casting to int or char as appropriate.

It might be helpful to uncomment the logger.warn call in spandex.database.

FYI -

spandex recreates its knowledge of the schema each time the database.refresh() method is called. That knowledge is only kept in memory and doesn't persist if you restart Python/spandex.

When spandex modifies table schema (for example to add a new column), it calls database.refresh(). database.refesh() iterates over every schema, table, column using SQLAlchemy/GeoAlchemy introspection/reflection.

For more information about reflection, see the SQLAlchemy documentation on Reflecting Database Objects and Using Reflection with Declarative.

Thanks @daradib! I appreciate the feedback and I'll try this out when I get a moment.

By uncommenting that logger.warn I was able to discover that the tables that were not showing up in the load.tables/database class were not being mapped because they did not have a primary key. Is this a GeoAlchemy convention (or requirement) for reflection or a Spandex one? SQLAlchemy does not require this. The QGIS database adapter also has this requirement/convention. Perhaps its necessary? However, it clearly confused me. Anyway, thanks @daradib!