doccano / doccano-client

A simple client for doccano API.

Home Page:https://doccano.github.io/doccano-client/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to download data?

LighthouseInTheSea opened this issue · comments

When I try to call the following code
doc_download = doccano_client.get_doc_download(2, 'json')
print(doc_download.text)
`
<!doctype html>

<title>doccano - doccano</title>
<style>#nuxt-loading{background:#fff;visibility:hidden;opacity:0;position:absolute;left:0;right:0;top:0;bottom:0;display:flex;justify-content:center;align-items:center;flex-direction:column;animation:nuxtLoadingIn 10s ease;-webkit-animation:nuxtLoadingIn 10s ease;animation-fill-mode:forwards;overflow:hidden}@Keyframes nuxtLoadingIn{0%{visibility:hidden;opacity:0}20%{visibility:visible;opacity:0}100%{visibility:visible;opacity:1}}@-webkit-keyframes nuxtLoadingIn{0%{visibility:hidden;opacity:0}20%{visibility:visible;opacity:0}100%{visibility:visible;opacity:1}}#nuxt-loading>div,#nuxt-loading>div:after{border-radius:50%;width:5rem;height:5rem}#nuxt-loading>div{font-size:10px;position:relative;text-indent:-9999em;border:.5rem solid #f5f5f5;border-left:.5rem solid #fff;-webkit-transform:translateZ(0);-ms-transform:translateZ(0);transform:translateZ(0);-webkit-animation:nuxtLoading 1.1s infinite linear;animation:nuxtLoading 1.1s infinite linear}#nuxt-loading.error>div{border-left:.5rem solid #ff4500;animation-duration:5s}@-webkit-keyframes nuxtLoading{0%{-webkit-transform:rotate(0);transform:rotate(0)}100%{-webkit-transform:rotate(360deg);transform:rotate(360deg)}}@Keyframes nuxtLoading{0%{-webkit-transform:rotate(0);transform:rotate(0)}100%{-webkit-transform:rotate(360deg);transform:rotate(360deg)}}</style><script>window.addEventListener("error",function(){var e=document.getElementById("nuxt-loading");e&&(e.className+=" error")})</script>
Loading...
<script>window.__NUXT__={config:{_app:{basePath:"/",assetsPath:"/static/_nuxt/",cdnURL:null}}}</script> <script src="/static/_nuxt/2b323e7.js"></script><script src="/static/_nuxt/9166e9e.js"></script><script src="/static/_nuxt/0f52d28.js"></script><script src="/static/_nuxt/c5544a6.js"></script> `

How do I get the downloaded data ?
  • Operating System: windows
  • Python Version: 3.10.2
  • Package Version: doccano(1.5.5) doccano-client(1.0.3)

return self.get_file( "v1/projects/{project_id}/docs/download".format(project_id=project_id), params={"q": file_format, "onlyApproved": str(only_approved).lower()}, headers=headers, )

http://xxx.xx.xxx.xx:xxxx/v1/projects/13/download

Error requesting address?

For those struggling with the same, I copied some code from another issue and added the zip file creation. It's all a bit obscure so I'm reposting it here

def export_project(project_id,save_path):
    result = doccano_client.post(f'{doccano_client_url}v1/projects/{project_id}/download', json={'exportApproved': False, 'format': 'JSONL'}) 
    task_id = result['task_id']
    while True:
        result = doccano_client.get(f'{doccano_client_url}v1/tasks/status/{task_id}')
        if result['ready']:
            break
        time.sleep(1)
    result = doccano_client.get_file(f'{doccano_client_url}v1/projects/{project_id}/download?taskId={task_id}')
    with open(save_path, 'wb') as f:
        for chunk in result.iter_content(chunk_size=8192): 
            f.write(chunk)

@PedroMTQ This is great, thanks.

For those struggling with the same, I copied some code from another issue and added the zip file creation. It's all a bit obscure so I'm reposting it here

def export_project(project_id,save_path):
    result = doccano_client.post(f'{doccano_client_url}v1/projects/{project_id}/download', json={'exportApproved': False, 'format': 'JSONL'}) 
    task_id = result['task_id']
    while True:
        result = doccano_client.get(f'{doccano_client_url}v1/tasks/status/{task_id}')
        if result['ready']:
            break
        time.sleep(1)
    result = doccano_client.get_file(f'{doccano_client_url}v1/projects/{project_id}/download?taskId={task_id}')
    with open(save_path, 'wb') as f:
        for chunk in result.iter_content(chunk_size=8192): 
            f.write(chunk)

I slightly modified the end of the function to avoid writing to disk. It returns the results as a list of json blocks. I also use the baseurl from the doccano_client object instead of the doccano_client_url parameter.

...
result = doccano_client.get_file(f'{doccano_client.baseurl}v1/projects/{project_id}/download?taskId={task_id}')
file_like_object = BytesIO(result.content)
zipfile_obj = ZipFile(file_like_object)
data = zipfile_obj.open(zipfile_obj.namelist()[0]).read().splitlines()
data = [json.loads(line) for line in data]
return data

Cheers!

Is there anyway to download the documents and include metadata?