How to download data?
LighthouseInTheSea opened this issue · comments
When I try to call the following code
doc_download = doccano_client.get_doc_download(2, 'json')
print(doc_download.text)
`
<!doctype html>
How do I get the downloaded data ?
- Operating System: windows
- Python Version: 3.10.2
- Package Version: doccano(1.5.5) doccano-client(1.0.3)
return self.get_file( "v1/projects/{project_id}/docs/download".format(project_id=project_id), params={"q": file_format, "onlyApproved": str(only_approved).lower()}, headers=headers, )
http://xxx.xx.xxx.xx:xxxx/v1/projects/13/download
Error requesting address?
For those struggling with the same, I copied some code from another issue and added the zip file creation. It's all a bit obscure so I'm reposting it here
def export_project(project_id,save_path):
result = doccano_client.post(f'{doccano_client_url}v1/projects/{project_id}/download', json={'exportApproved': False, 'format': 'JSONL'})
task_id = result['task_id']
while True:
result = doccano_client.get(f'{doccano_client_url}v1/tasks/status/{task_id}')
if result['ready']:
break
time.sleep(1)
result = doccano_client.get_file(f'{doccano_client_url}v1/projects/{project_id}/download?taskId={task_id}')
with open(save_path, 'wb') as f:
for chunk in result.iter_content(chunk_size=8192):
f.write(chunk)
For those struggling with the same, I copied some code from another issue and added the zip file creation. It's all a bit obscure so I'm reposting it here
def export_project(project_id,save_path): result = doccano_client.post(f'{doccano_client_url}v1/projects/{project_id}/download', json={'exportApproved': False, 'format': 'JSONL'}) task_id = result['task_id'] while True: result = doccano_client.get(f'{doccano_client_url}v1/tasks/status/{task_id}') if result['ready']: break time.sleep(1) result = doccano_client.get_file(f'{doccano_client_url}v1/projects/{project_id}/download?taskId={task_id}') with open(save_path, 'wb') as f: for chunk in result.iter_content(chunk_size=8192): f.write(chunk)
I slightly modified the end of the function to avoid writing to disk. It returns the results as a list of json blocks. I also use the baseurl from the doccano_client object instead of the doccano_client_url parameter.
...
result = doccano_client.get_file(f'{doccano_client.baseurl}v1/projects/{project_id}/download?taskId={task_id}')
file_like_object = BytesIO(result.content)
zipfile_obj = ZipFile(file_like_object)
data = zipfile_obj.open(zipfile_obj.namelist()[0]).read().splitlines()
data = [json.loads(line) for line in data]
return data
Cheers!
Is there anyway to download the documents and include metadata?
fixed #59