usc-isi-i2 / Web-Karma

Information Integration Tool

Home Page:http://www.isi.edu/integration/karma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions: compairing with data from other row and combining multiple cvs into one

MartinSandberg opened this issue · comments

Hi,

Trying to evaluate if Karma would be suitable for my organisation to use.

Looks very promising to use when the only thing you need to do is handle each row separately.
Managed to do most of the transformations we need to do in PyTransform (concatenating, compairing dates and much more). However we do need to do a few things more.

  1. Is it possible in PyTransform to get data from other rows of a CSV
    For example we have a row with "userid" and one with "managerid". From the current row I get the "managerid" and want to read "managerid" in the row that has a "userid" corresponding with the current rows "managerid".

  2. Is it possible to find and exclude rows that are not unique (same userid on multiple rows for instance) or based on other criteria like a field that is empty

  3. We need to combine multiple CSVs with different transformations into one single output CSV. I'm guessing this is done in a later stage (modelling?)

Hi Martin, I'm a Karma user and I might be able to answer some of your questions.

  1. I don't think you can do this with rows, but you can definitely do it with columns, you can use the getValue() function in a PyTransform to refer to any column by name, such as getValue('userid') or getValue('managerid') and manipulate them however you like. So you might want to transpose your data before importing it into Karma so that your row IDs are column IDs.

  2. I am not sure about excluding duplicate rows but you can definitely exclude rows with empty fields using the Selection feature. If you can't exclude duplicates with the Selection feature, you might consider doing some preprocessing to remove duplicates before you import. I myself have started doing all my cleaning and transformation preprocessing in a Jupyter notebook with the Python Pandas library because it turns out the Jython implementation that Karma uses to implement the PyTransform functionality is rather old and has been end-of-lifed, its final release was in 2015. I wanted to be able to use some Python 3 libraries for certain more complex transformation tasks, and the Jupyter+Pandas solution has been really great, it ultimately makes it easier to create the Karma models because I'm importing really clean data in exactly the columns I need. I especially recommend it to anyone who is already comfortable using Python. PyTransforms are best suited to simple transforms and people without programming experience.

  3. I combine my output data via SPARQL queries. I have had some issues using the query interface in the OpenRDF workbench in Karma; it too is another apparently unmaintained piece of software, so I export my integrated data as N-Triples or N-Quads and import them into a different triplestore. I like Blazegraph because it's dead simple to get up and running -- just execute a .jar file -- but I've also heard good things about GraphDB Free.

Good luck!