dahlia / wikidata

Wikidata client library for Python

Home Page:https://pypi.org/project/Wikidata/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Would you mind a rewrite?

BMaxV opened this issue · comments

commented

Hey I wanted to use wikidata previously and wrote my own little script. Would you accept a pull request if I polished that up a bit?

I don't really understand why you use all those different modules and programming techniques, like caches or why you need to create a client to do a http request for you. If you could give me a list of features you want, I can try to reach feature parity.

I don't really understand why you use all those different modules and programming techniques, like caches or why you need to create a client to do a http request for you.

Caches are useful when certain cases that only few entities are frequently queried in a short period of time. It's anyway turned off by default and the way to store them can be customized (and testable) through the CachePolicy interface.

Client also has several roles. First of all, it improves its testability — you don't need any mocks or monkey patches for unit testing. It also makes you able to customize the policy for HTTP networking without any global states. For example, you can configure HTTP proxies and timeout settings for networking to Wikidata and at a time also remain HTTP settings to be default in the global level.

Would you accept a pull request if I polished that up a bit?

Feel free to send pull requests, but we may not accept it if it's far from project's philosophy. The project is GPLv3 or later which means it's free to fork for your own. 😄

commented

Ok caching and testing makes sense...

we may not accept it if it's far from project's philosophy.

Of course. When I learned about wikidata in general and the python module, I expected something that would resemble the structure as you can see it on wikidata: an Entity that has properties or claims that you can iterate over. And when you have their ids, you should be able to request them as you did with the original one.

This is what I wrote after I couldn't make your module do what I wanted. I'd rather integrate that as fas as it makes sense into your package and not create another one with a less obvious name than "wikidata". :)

So the question would be what you want the module to do? Besides caching and testing?

Maybe it would make sense to split it up into a networking package and a wikidata entity package, leaving the method of requests to the user and just provide the code to unpack and pythonify the wikidata json?

Also, as far as I can tell, all iter functions and lists throw type errors, because there are unsupported types in your collection(?):

from wikidata.client import Client
c=Client()
e=c.get("Q42")
for e_i in e.iterlists():
    print(e_i)

returns a collection datavalue error.

I expected something that would resemble the structure as you can see it on wikidata: an Entity that has properties or claims that you can iterate over.

An Entity instance implements mapping protocol (i.e., dict-like object); keys are properties and values are associated entities through corresponding properties. In the similar way to ordinary dict objects, you can iterate over keys (i.e., properties) of an Entity object:

>>> from wikidata.client import Client
>>> c = Client()
>>> e = c.get('Q42', load=True)
>>> e
<wikidata.entity.Entity Q42 'Douglas Adams'>
>>> props = list(e)  # Iterate keys of e
>>> props
[<wikidata.entity.Entity P31>, <wikidata.entity.Entity P21>, ..., <wikidata.entity.Entity P2949>]
>>> p31 = props[0]
>>> p31.label
m'instance of'

And when you have their ids, you should be able to request them as you did with the original one.

As I mentioned above Entity instance is dict-like and the keys are properties and the values are associated entities through the properties. So you don't have to request them by yourself but you can simply get them through an index operator (again, it's analogous to a dict):

>>> e[p31]
<wikidata.entity.Entity Q5>
>>> e[p31].label
m'human'
>>> p21 = props[1]
>>> p21.label
m'sex or gender'
>>> e[p21]
<wikidata.entity.Entity Q6581097>
>>> e[p21].label
m'male'

This is what I wrote after I couldn't make your module do what I wanted. I'd rather integrate that as fas as it makes sense into your package and not create another one with a less obvious name than "wikidata". :)

Even though I may not exactly understand the intention of your script, you can solve it using the wikidata library:

import random
from wikidata.client import Client

def test():
    client = Client()
    doug = client.get('Q42', load=True)
    print(doug)

    random_claim_id = random.choice(list(doug))
    random_claim = client.get(random_claim_id)
    print(random_claim)

    me = client.get('P2534', load=True)
    print(me)

if __name__ == '__main__':
    test()

So the question would be what you want the module to do? Besides caching and testing?

I believe above I wrote can be the answer. I think it's not that different from what you want. 😄

Maybe it would make sense to split it up into a networking package and a wikidata entity package, leaving the method of requests to the user and just provide the code to unpack and pythonify the wikidata json?

That might be better, but I bet if I'd split a networking package it wouldn't be so general that it's useful for thing other than Wikidata client library. Instead I split the wikidata.client module from wikidata.entity module and we're like satisfied with this approach. 🤔

Also, as far as I can tell, all iter functions and lists throw type errors, because there are unsupported types in your collection(?):

from wikidata.client import Client
c=Client()
e=c.get("Q42")
for e_i in e.iterlists():
   print(e_i)

returns a collection datavalue error.

It's basically because the project is immature yet. The wikidata.datavalue module is not complete and we need to implement more widely-used data types to it. One workaround to this is to iterate over keys rather than items (i.e., (key, value) pairs). DatavalueError won't be raised until you try to get that specific value.

Overall, troubles you'd experienced with this package seems due to lack of docs, tutorial and manual in particular. Currently I don't have much spare time to improve the project, but I would write more comprehensive docs for users of the package.