cache historical data

Question

cache historical data

dermatzeimnetz opened this issue 9 months ago · comments

Please save downloaded Zips in data folder, if they are older than current year
Maybe add a config where to save them or "false" if you dont want to save at all. If nothing is set, defaults to "save to data folder"
Its not about the bandwidth, but the time improvement.

Example:

extract 20 different legacy futures one by one, without redownloading the entire file
while developing, shorten the time to wait to fetch the data, when you want to test your script again

philsv · Answer 1 · Tue Oct 24 2023 22:44:48 GMT+0800 (China Standard Time)

The package is already using caching via @lru_cache. One the second load you should be able to see a noticable time improvement. Downloaded files are currently loaded into your tmp directory via tempfile. There are no zip files saved in the temp directory as they are loaded as an object directly into extract_text_file_to_dataframe(). I would not recommend loading the data into the data folder, I moved away from that approach as the tempfile approach is much cleaner IMO.

extract 20 different legacy futures one by one, without redownloading the entire file

I think this should already be available via @lru_cache or could you explain what you are experiencing when using for example legacy_report?

while developing, shorten the time to wait to fetch the data, when you want to test your script again

Should not fetch the data again because of @lru_cache.

Have you tried something like this?

def get_cot_data(cot_report_data: pd.DataFrame, contracts: list):
   df = cot_report_data[cot_report_data["Contract Name"].isin(contracts)]
   ...  # your own implementation
   return df

contracts_1 = [
               "BBG COMMODITY - CHICAGO BOARD OF TRADE", 
               "BLOOMBERG COMMODITY INDEX - CHICAGO BOARD OF TRADE"
]

contracts_2 = [
                "FED FUNDS - CHICAGO BOARD OF TRADE",
                "30-DAY FEDERAL FUNDS - CHICAGO BOARD OF TRADE",
]

df_1 = get_cot_data(legacy_reports(), contracts_1)  # loads longer
df_2 = get_cot_data(legacy_reports(), contracts_2)  # this should now load faster because of @lru_cache

dermatzeimnetz · Answer 2 · Wed Oct 25 2023 14:59:26 GMT+0800 (China Standard Time)

@lru_cache only works if the parameters are exactly the same
calling 3 times with 3 different Markets is not cached, every function is calculated completely

start = time.perf_counter()
df = legacy_report("legacy_fut", "CORN - CHICAGO BOARD OF TRADE")
print(f"\n\tExecution time: {round(time.perf_counter() - start, 3)} sec")

start = time.perf_counter()
df = legacy_report("legacy_fut", "SOYBEANS - CHICAGO BOARD OF TRADE")
print(f"\n\tExecution time: {round(time.perf_counter() - start, 3)} sec")

start = time.perf_counter()
df = legacy_report("legacy_fut", "COCOA - CHICAGO BOARD OF TRADE")
print(f"\n\tExecution time: {round(time.perf_counter() - start, 3)} sec")

Execution time: 10.733 sec
Execution time: 10.182 sec
Execution time: 10.193 sec

https://docs.python.org/3/library/functools.html : "The cache keeps references to the arguments"

philsv · Answer 3 · Wed Oct 25 2023 18:43:30 GMT+0800 (China Standard Time)

I have added a convenience function for that problem you have in 0.0.7.

Try this:

from pycot.reports import cot_report, legacy_report

df = cot_report(legacy_report("legacy_fut"), "CORN - CHICAGO BOARD OF TRADE")  # will load the full report (~ 10-15 seconds)
df = cot_report(legacy_report("legacy_fut"), "SOYBEANS - CHICAGO BOARD OF TRADE")  # cached results
df = cot_report(legacy_report("legacy_fut"), "COCOA - CHICAGO BOARD OF TRADE")  # cached results

dermatzeimnetz · Answer 4 · Wed Oct 25 2023 19:32:14 GMT+0800 (China Standard Time)

Perfect thank you