fumitoh / modelx

Use Python like a spreadsheet!

Home Page:https://modelx.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature to output input values to a log file

alebaran opened this issue · comments

from modelx import *
import pandas as pd
m,s = new_model(),new_space('s')
series = pd.Series([1, 2], pd.Index([3, 4], name='a'))
s.new_cells_from_pandas(series, 'x')
s.x[3] = 10
write_model(m, 'm')
m2 = read_model('m')
print(m.s.x[3], m2.s.x[3])

Cells created by new_cells_from_xxx and new_space_from_xxx method should be read-only for now. There's no way to tell whether the value is read from the Series or input by you.

Isn't it possible to just store the content of the cell regardless how it had been created?

How about this?

from modelx import *
import pandas as pd

m, s = new_model(), new_space('s')
s.y = pd.Series([1, 2], pd.Index([3, 4], name='a'))

@defcells
def x(a):
    return y[a]

s.x[3] = 10
write_model(m, 'm')
m2 = read_model('m')
print(m.s.x[3], m2.s.x[3])

Well, I'm loading all the data from pandas. I can probably make my own "from_pandas" function and in this way trick the saving engine not to apply default "from_pandas" encoding approach.
Why can't encoder just dump cell.frame instead of source df?

In general I'm struggling to benefit from the ability to compare 2 versions of the saved model. Very frequently I'm using dataframe stored in a cell without arguments. I can't create it using "from_pandas" function, so I'm using "new_cells" function directly and assign value to it. In this case the encoder doesn't provide unique name to the data file and I can't compare 2 models.
I know that your idea is to store it as references, but it will be significant change for me to move there as name based extraction mechanism differs (I'll need to differentiate between .cells[name] and .refs[name)

The ideal situation, I can imagine, is to have modelx structure (argument-value relationship) stored in a json file and non-plain values stored as a reference instead of value. For pandas objects the reference can be control sum driven (MD5).
The same approach would work well for the overwrites of the formula based cells (the current approach of storing argument-value for overwrites is cryptic)

All Reference values and Cells' input values are stored in data/data.pickle file in the model folder,
except for modelx objects and values of type bool, int, float, str and module. This is for maintaining the consistency of saved objects' identities.

For example,

import pandas as pd
import modelx as mx

df = pd.DataFrame()

space = mx.new_space()

@mx.defcells
def foo(): pass

@mx.defcells
def bar(): pass

foo[()]= df
bar[()]= [df]
space.baz = {1:[df]}

Then foo() is bar()[0] is space.baz[1][0] must be True after writing and reading back the model. So, saving only df in a separate file maintaining this consistency is impossible.

I am thinking of introducing a new type of Reference that represents data read from files.
Something like:

space.new_data(name="y", value=df, path="C:/data.file", encoder="pickle")

Then you can do like:

foo.formula = lambda: y
bar.formula = lambda: [y]
space.baz = {1:[y]}

This way, foo() is bar()[0] is space.baz[1][0] == True even after writing and reading back the model.

Why is this useful: foo() is bar()[0] is space.baz[1][0]?

So that changes in df are reflected in foo, bar, baz. Note this discussion applies to all mutable types, not just to DataFrame.

I don't think it is that useful to have such a change tracking mechanism explicitly implemented around data object. I'm pretty happy with storing df in a cell (foo in your example) and referencing it in the formulas of the other cells (bar and baz in your example).
I don't see how you can track changes within mutable object and the way it affects modelx cells without introducing modelx wrapper around the object. And that ends up to be the same as the cell wrapper.

I think you have extremely strong and flexible framework around cells, which is to a large extent inspired by Excel framework. I would spend time on fully leveraging it rather than introducing new constructs.
E.g. Excel is fully self sufficient with just cells containing data and formulas. References are only used to link the cells in different worksheets and workbooks (no data items are stored in references).
Recent version of Excel allows storing dynamic arrays in a cell, referencing it in another cell as a whole and applying functions to it.

Another argument here is the need to maintain consistency between data items and overwrites of the formula driven cells.
I built model around operations with pandas objects: pandas objects are provided as inputs, created by formulas and overwritten, when necessary.

If no object identity consistency, then the dfs in foo and bar would be 2 different objects after being read back. Having many copies of large DataFrames in memory should be a problem.

This is about appropriate usage of high level programming language. It will of course be a problem if multiple copies of the object are created with minor modifications for no reason. It will be also a problem in Excel if vector of 10000 cells is copied in a formula 1000 times adding 1 element each time.
I have built my model around operations with large objects. It was important to design it in a way that modified objects are only stored (as a modelx cell value), when it needs to become part of the dependency graph of modelx. Otherwise minor modifications like the ones in your example (bar and baz) can be introduced either directly in the formulas of cells using bar and baz or through creation of Python functions bar and baz outside modelx framework, which are called in the dependent cells with foo as an argument.

Have you experienced practical issue due to the copies in memory?

Btw. modelx doesn't create copies of pandas objects, when it isn't specifically forced to do it. It was something I had to get used to and explicitly use df.copy() in the dependent cell to avoid affecting predecessor, when modifying df. I careful assess when I need to do copy or just referencing is enough.

Let's say df is a large DataFrame. It's completely fine to assign df to 10000 cells. And you would expect that the 10000 cells point to the same df even after the model is saved and read back. The only way I can think of to achieve this automatically is to use pickle. But pickle cannot save objects in separate files. So you need some mechanism to manually mark the same object, which is what I meant by the idea above. The idea is to use a Reference as an entry point into the model for data read from files.

modelx Cells are always pass-by-reference, and cannot trace changes of mutable objects' contents.

Why can't in your example df be stored as a cell value and the other cells referencing it?

You mean this code?

import pandas as pd
import modelx as mx

df = pd.DataFrame()

space = mx.new_space()

@mx.defcells
def foo(): pass

@mx.defcells
def bar(): pass

foo[()]= df
bar[()]= [df]
space.baz = {1:[df]}

How can bar and baz reference the same df as foo after being read back?

Won't this work:

import pandas as pd
import modelx as mx

df = pd.DataFrame()

space = mx.new_space()

@mx.defcells
def foo(): pass

@mx.defcells
def bar(): pass

foo[()]= df
bar[()]= [foo()]
space.baz = {1:[foo()]}

You can do it manually, but I cannot think of any way to make read_model do:

bar[()]= [foo()]
space.baz = {1:[foo()]}

Sorry I wasn't precise. I meant this

import pandas as pd
import modelx as mx

df = pd.DataFrame()

space = mx.new_space()

@mx.defcells
def foo(): pass

foo[()]= df
@mx.defcells
def bar(): return [foo()]

@mx.defcells
def baz(): return {1:[foo()]}

Whether to assign [df] to bar[()] as an input value or let bar.formula refer to foo() is up to the user. Both should be acceptable.

My revised idea is to add file property to Cells.

import pandas as pd
import modelx as mx

df = pd.DataFrame()

space = mx.new_space()

@mx.defcells
def foo(): pass

@mx.defcells
def bar(): pass

foo[()]= df
bar[()]= [df]
space.baz = {1:[df]}

foo.file = "foodata"

Then the input value of foo is stored in a separate file named "foodata". Other values remain to be written to data/data.pickle. After being written and read back, bar()[0] is space.baz[1][0] but foo() is not bar()[0].

I need to think about how to specify file for Dynamic Cells, and need some work around serializer.

It looks complex and introduces inconsistency. It also doesn't allow to track changes in specific data entries. My actual use case is as follows:

@mx.defcells
def foo(t): pass

foo[1]= df1
foo[2] = df2

It isn't essential for me if df1 and df2 are stored in the main data.pickle file or in one foo.pickle or in separate pickled files. What is essential is to be able to track changes in the data items. In the case above the following could be saved in the foo.file:

[[(1), "abfjfnsksk28"],
[(2), "dkckfndjxj38"]]

Where abfjfnsksk28 and dkckfndjxj38 are control sums on the contents of dataframe.

This approach allows consistent treatment of the overwrites of the formula cells:

@mx.defcells
def foo(t): return foo(t-1) + 1

foo[1]= df1

If df1 and df2 are 2 different DataFrames, but have the same values, then their MD5 can be the same, so there's no way to tell they are different objects.

The current implementation stores the object ids of keys and values in Space1/data/foo:

(1253944192008, 1253942890888)       
(1253944191816, 1253938289864)

1253944192008 is id((1,))
1253942890888 is id(df1)
1253944191816 is id((2,))
1253938289864 id id(df2)

And data.pickle contains a pickle of a dictionary mapping the ids to their objects:

{
    1253944192008: (1,),
    1253942890888: df1,
    1253944191816: (2,),
    1253938289864: df2
}

The objects have different ids when the model is read back.
It may be possible to map object ids to uuids and use uuids for saving. But that doesn't tell if the values in the dfs have changed or not unless you save the dfs in separate files.

I assume you are talking about the case, when there is actually something different between 2 dfs - not just that they are stored separately in memory, but MD5 of df.to_json() is the same

I can think of a few options to implement it with stable keys and overcome the issue with duplicate hash:

  1. Use combination of hash and ID, when storing the object. It is important to have the same ID, if the model is opened, modified and saved.
    1a) Similar to above, but only add ID to hash in case different object with the same hash is identified.
  2. Separate hash representation from the actual decoding mechanics. For example, you can keep the current way of referencing data, but hide it on the pickled object and at the same time show hashed representation in the JSON file.

Btw. why don't you store objects in the pickled dictionary using args tuple as a key?

I took me long enought to figure out why you want to store data in a separate file for a variable. I guess you do it to allow running the model with a different file.
I gave the above a bit more thought and I think it makes sense to separate encoding / decoding from change tracking for complex objects. Combining it with the files interchangability concept, this is how it may look:

  • For each cell with data create a JSON/CSV file. Store information in the file using key-value pairs. Where key is (args) and value is the the value with the given set of arguments.
  • In case (args) or value isn't plain show it as str(args) or str(value) in the JSON file (I'm not sure if this is necessary for (args)). And create variable.pickle file containing dictionary of the complex objects.

The limitations of the above solution are

  1. the saved file isn't the same as the source file in case of from_csv call
  2. in case of complex objects it isn't possible to modify the data without loading the model

I think both aren't an issue in case of from_df or code overwrite as a data source as those aren't file based and require Python code anyway for changes.

The limitation 1 can be resolved by desigining CSV above to be consistent with from_csv call.

How about adding an option to enable to output a single log file that contains MD5 of all input values:

Space1.foo(t=1)  4504e8c67981532542aa33a1be3ebac5
Space1.foo(t=2)  8acf3e2acf05b6026bc77cfd1a9588fe

Isn't this enough for your purpose? I want to make it disabled by default as it slows down writing.

This is a good alternative :-) What do you think of str(value) instead of MD5? It will be more informative and still easy to trace changes in the text editor

Will you fix the overwrites saving issue?

I want to fix it but it may take a while to get to it. Can you not use new_cells_from_pandas if you wan to overwrite values?

It works, thank you

One more thing as we are talking about model saving. Have you thought about zip archiving the folder? It will save space and make it more convenient to send the model around. We save the model on OneDrive cloud and it ends up syncing a hundered files instead of 1.
Using zip archive will be the same as what Excel does with its files.
I can also implement this myself.