Community Database Dump

Question

Community Database Dump

anindyamaiti opened this issue 5 years ago · comments

I started fresh last month, and have ~1.7M torrents in my database after about 3 weeks. And, I plan to keep my magneticod running for the foreseeable future, expecting to add about ~50K per day after the initial spike converges.

There have been requests for database dump before, but no one has shared theirs in my knowledge. So, I thought of taking the initiative. Here is my website where I will share (via .torrent) my database dump 1-2 times a month: https://tnt.maiti.info/dhtd/

You can use it as-is or to get a head start with magneticod. And, don't forget to seed!

Huge thanks to @boramalper for making this project happen.

Jeremy Kescher · Answer 1 · Mon Oct 14 2019 14:04:03 GMT+0800 (China Standard Time)

I will soon create a page under https://kescher.at/magneticod-db for sharing my own SQLite database backups.

Anindya Maiti · Answer 2 · Mon Oct 14 2019 14:37:02 GMT+0800 (China Standard Time)

Thanks @kescherCode. I suggest you take the same approach of sharing via .torrent.

As far as GitHub is concerned, I am confident that sharing an external webpage that contains .torrent files of self-created databases is not in violation of any rules. Neither the webpage posted here nor the torrent/database contains any copyrighted material.

Glandos · Answer 3 · Mon Oct 14 2019 16:16:25 GMT+0800 (China Standard Time)

That's good news.

Someone is brave enough to write a small script to merge 2 databases?

Jeremy Kescher · Answer 4 · Mon Oct 14 2019 16:17:34 GMT+0800 (China Standard Time)

That's good news.

Someone is brave enough to write a small script to merge 2 databases?

https://github.com/kd8bny/sqlMerge

Alternatively:
https://stackoverflow.com/a/37138506/11519866

Jeremy Kescher · Answer 5 · Mon Oct 14 2019 17:54:22 GMT+0800 (China Standard Time)

I have put up a page at https://kescher.at/magneticod-db now :)

Glandos · Answer 6 · Mon Oct 14 2019 19:18:01 GMT+0800 (China Standard Time)

That's good news.
Someone is brave enough to write a small script to merge 2 databases?

https://github.com/kd8bny/sqlMerge

Alternatively:
https://stackoverflow.com/a/37138506/11519866

There is a real issue here.
First, the sqlMerge repository doesn't work even by merging kd8bny/sqlMerge#3 and kd8bny/sqlMerge#4 since it relies on str(row) which doesn't work on BLOB.

Then, I explored manual merging. Obviously, torrents.id needs to be modified, but the issue is that there is a foreign key from files.torrent_id to torrents.id, so that the script should:

Read a record from merged_db
Insert it in original_db and record the insertion id
Modify merged_db.files.torrent_id according to the new id, and merge the merged_db.files rows.

This is not so difficult, but I don't think it can be done using a generic script.

Prime · Answer 7 · Wed Oct 16 2019 05:33:47 GMT+0800 (China Standard Time)

It may be ideal to compress the database before sharing.

Compressing with LZMA2 (7zip) on the 'fastest' preset (64K dictionary, 32 word size) yields a file 23.3% of the size of the uncompressed database.

Compressing on the 'normal' preset (16M dictionary, 32 word size) yields a file 15-18.2% of the size of the uncompressed database.

I'd suggest using the following command on linux systems with xz-utils

xz -7vk --threads=12 database.sqlite3

7 is the compression level (out of 9)
v is for 'verbose', shows you the progress of the compression in your TTY
k is for 'keep', meaning it wont delete your database after the compression is done >_>
--threads=12 specifies to use 12 threads, you can use --threads=0 to tell xz to use as many threads as there are CPUs on your server.

This will produce a file named database.sqlite3.xz

It can then be decompressed after being downloaded using unxz -v database.sqlite3.xz (or -vk if you want to keep the archive).

I've taken the liberty of compressing both of the shared databases so far: https://public.tool.nz/DHT/

Jeremy Kescher · Answer 8 · Wed Oct 16 2019 05:42:37 GMT+0800 (China Standard Time)

@AlphaDelta I agree, and I will compress my future torrents containing a database.sqlite3 with xz, however probably with the most aggressive preset you've ever seen:

xz --lzma=dict=1536Mi,nice=273

Prime · Answer 9 · Wed Oct 16 2019 05:56:46 GMT+0800 (China Standard Time)

@kescherCode That is indeed the most aggressive preset I've ever seen 😂

Just keep in mind the entire dictionary is loaded into memory when decompressing, so it would require allocating 1536MiB of memory just to decompress the database.

Probably not worth riding the exponential-cost train that far.

Anindya Maiti · Answer 10 · Wed Oct 16 2019 09:59:54 GMT+0800 (China Standard Time)

That's a very significant size reduction using xz!

I will share both compressed and uncompressed versions from next time. Those who have a low-memory VPS for hosting magneticow, may want to directly download the uncompressed database.

Alexey Skobkin · Answer 11 · Wed Oct 16 2019 23:20:25 GMT+0800 (China Standard Time)

Someone is brave enough to write a small script to merge 2 databases?

BTW I was writing a simple and dirty tool to migrate old magneticod (Python) database data to the new version (including PostgreSQL and any another supported engine). ~~It's not optimized yet and uses too much memory (loads all torrents from the database at once)~~.

But if you're not scared of bad ~~and not optimized~~ code you can try to use it.

UPD: just pushed small README.md update and added ability to use not merged yet postgres and beanstalk engines as well as upstream's sqlite3 and stdout.

Glandos · Answer 12 · Sat Oct 26 2019 21:42:56 GMT+0800 (China Standard Time)

Here it is : https://framagit.org/Glandos/magnetico_merge
It's hosted on another instance of GitLab, but you can login with your GitHub account if you want to contribute. If you want to fork it on GitHub, please let me know so I can follow your improvements :)

Alexey Skobkin · Answer 13 · Sat Oct 26 2019 22:53:04 GMT+0800 (China Standard Time)

@Glandos It worth mentioning that it's working only with SQLite databases.

Glandos · Answer 14 · Sun Oct 27 2019 05:12:56 GMT+0800 (China Standard Time)

Yes indeed, but it is the only database currently supported :) And the only database that is shared. Sharing postgresql database for merging is not complex, but different.

Glandos · Answer 15 · Mon Oct 28 2019 05:08:32 GMT+0800 (China Standard Time)

Merging Magneticod bootstrap 2019-10-14/database.sqlite3 into database.sqlite3
Gathering database statistics: 4835749 torrents to merge.
  [######################################################]  4836000/4835749
Comitting… OK. 4835749 torrents processed. 2820832 torrents were not merged due to errors.

Here it is. Now, I have a big merged database with your both database. More than 7 millions torrents with more than 216 millions file entries.
I should share this database when I'll have time.

Anindya Maiti · Answer 16 · Mon Oct 28 2019 06:05:53 GMT+0800 (China Standard Time)

7 millions torrents

Did you not remove duplicates? I would guess that there would be significant overlap between the databases.

Glandos · Answer 17 · Mon Oct 28 2019 16:25:48 GMT+0800 (China Standard Time)

There are a lot of overlap as you saw in the merge report : 2820832 torrents were not merged due to errors. The message is not clear, but it usually means that some constraint was violated, and the merge insert was skipped.

Alexey Skobkin · Answer 18 · Fri Nov 01 2019 03:05:23 GMT+0800 (China Standard Time)

Did you not remove duplicates?

What do you call a duplicate? Torrents with the same infohash couldn't be inserted again.

Glandos · Answer 19 · Sun Nov 03 2019 04:59:11 GMT+0800 (China Standard Time)

Here is mine: https://antipoul.fr/dhtd/ This is very basic.
It includes databases from https://tnt.maiti.info/dhtd/ and https://kescher.at/magneticod-db at the time of writing.
It is huge (21GB after decompression), but it works on my Atom D2550, so it should work anywhere.

Glandos · Answer 20 · Sat Nov 09 2019 22:15:01 GMT+0800 (China Standard Time)

@anindyamaiti Thanks for your regular updates. Your page is very nice. Do you think you can add an RSS feed? I know it's another thing to do :)

Bora M. Alper · Answer 21 · Sun Nov 10 2019 00:54:23 GMT+0800 (China Standard Time)

Pinned! I think once we implement import & export functionality, it'd be even easier (and portable across different databases). =)

Closing because it's not an issue but feel free to keep the discussion & sharing going.

Anindya Maiti · Answer 22 · Sun Nov 10 2019 01:05:44 GMT+0800 (China Standard Time)

Do you think you can add an RSS feed?

@Glandos I was thinking of the same. Here is a basic automated RSS feed of the ten most recently added files: https://tnt.maiti.info/dhtd/rss.php

Nothing fancy, just the filenames, but it should be good enough for a notification.

If anyone else is interested in incorporating RSS for their shares, here is my (dirty) PHP code: https://tnt.maiti.info/dhtd/rss.php.txt

Pinned!

@boramalper thanks for the pin! 😊

Dessalines · Answer 23 · Sun Dec 01 2019 16:03:51 GMT+0800 (China Standard Time)

@boramalper suggested I point to torrents.csv, an open repository of torrents / global search engine. Here's the issue for potentially adding people's data to this.

Glandos · Answer 24 · Mon Feb 03 2020 22:24:20 GMT+0800 (China Standard Time)

I have a new dump of my own: https://antipoul.fr/dhtd/20200203_9.2M_magnetico-merge.torrent

Since my server is really low on CPU, I didn't use XZ and switch to zstandard. The output is larger, but much faster to compress / decompress.

No tracker inside, so I will be the first swarm.

Jeremy Kescher · Answer 25 · Tue Feb 04 2020 03:54:22 GMT+0800 (China Standard Time)

I, too, have released a new dump on https://kescher.at/magneticod-db.
It has around 6.4 million torrents in it.
It uses zstandard compression from now on as well.

Obviously not relying on trackers either, just the DHT.

In case your client allows manual adding of peers and your client doesn't seem to find a connection, feel free to add 185.21.216.171:55451 as peer.

Also, I may seed other dumps here in order to increase availability for people that want to bootstrap their db, hence why I call my torrents "Magneticod bootstrap".

Jeremy Kescher · Answer 26 · Fri Apr 17 2020 01:01:38 GMT+0800 (China Standard Time)

I have released a new dump, having roughly 10.8 million torrents.

You can get it here.

If you can't find any peers through DHT, add 185.21.216.171:55451 as peer if your client allows it.

Kenan Sulayman · Answer 27 · Sun Jul 05 2020 23:22:37 GMT+0800 (China Standard Time)

@kescherCode could you update your dump? I'd offer to host it as a direct download on one of my servers.

I considered sharing my version but figured your public magnetico instance has over 11.3 million torrents now which makes my 8 million look rather pale in comparison.

Jeremy Kescher · Answer 28 · Sun Jul 05 2020 23:24:48 GMT+0800 (China Standard Time)

@19h I will soon create a new torrent, yes. However, do feel free to share your dump as well, as dumps can be combined together to create a bigger database. Some people find torrents my instance can't find ;)

Alexey Skobkin · Answer 29 · Mon Jul 06 2020 04:06:14 GMT+0800 (China Standard Time)

If you interested I can share dump of an instance which uses PostgreSQL to store collected data.

magneticod=# SELECT COUNT(id) FROM magneticod.torrents;
  count   
----------
 14546639

Kenan Sulayman · Answer 30 · Mon Jul 06 2020 06:13:22 GMT+0800 (China Standard Time)

@kescherCode I tar-balled my dump for you here: https://r1.darknet.dev/magnetico-20200705-22b4c048c924e825d147a3fce7cb43f826fae221.tar (24'993'610 KB ~ 24G; server has a unmetered 10Gbit uplink, go wild).

@skobkin I'd love to have that! If we merge all our dumps, we may get a new super database. Would be exciting!

Glandos · Answer 31 · Mon Jul 06 2020 16:48:01 GMT+0800 (China Standard Time)

@skobkin I'll be glad to try to merge a dump from you postgres. I wrote the merger in python, that is only able to read sqlite for now. I guess that importing a dump in postgres and then merging is too big, so I'll try to read the dump in itself. But for now, the output will still be sqlite3.

EDIT: if you can go with pg_dump in custom format, I'll be able to use https://github.com/gmr/pgdumplib

Glandos · Answer 32 · Mon Jul 06 2020 16:50:39 GMT+0800 (China Standard Time)

@19h I am currently merging your dump, thanks! But next time, you should at least compress it with gzip or zstd :)

Kenan Sulayman · Answer 33 · Mon Jul 06 2020 18:22:46 GMT+0800 (China Standard Time)

@Glandos sorry about that. The server has more bandwidth than CPU.. :-)

Alexey Skobkin · Answer 34 · Mon Jul 06 2020 18:44:21 GMT+0800 (China Standard Time)

@Glandos Dump is being made right now. But it's in plain format. I probably can make another custom dump later.

I've also thought that I could've made import/export feature for my magnetico-web. As I said in #197, the already supported JSON Lines format would be one of the best solutions to reach maximum interoperability between database backends and other implementations even not working with magnetico itself.

Already created issue for myself :)

Glandos · Answer 35 · Mon Jul 06 2020 21:47:01 GMT+0800 (China Standard Time)

OK, my server also has a small CPU (Atom D2550), but here is my fresh dump, consolidated with all known database to date: https://antipoul.fr/dhtd/20200706_13.6M_magnetico-merge.torrent

Kenan Sulayman · Answer 36 · Mon Jul 06 2020 22:05:54 GMT+0800 (China Standard Time)

@Glandos Amazing! Thanks!

Alexey Skobkin · Answer 37 · Mon Jul 06 2020 23:25:27 GMT+0800 (China Standard Time)

@Glandos I've almost wrote you a question about possibility to implement JSON export in your tool to be able to convert SQLite databases to the JSON files.

But then I realized that I've already did write a migration tool which uses magnetico's own database persistence layer to store data. So it should also work with you SQLite dumps or it it wouldn't it should be easy to fix because there was very slight difference between old and new magnetico database schemas.

When I have time I'll try to download one of your dumps and check if it works with my migrator tool.

If it works, then we should be able to export also any (new) magnetico database to any of supported formats: JSON (stdout), PostgreSQL (postgres if using my fork) or any other backend which will be implemented in magnetico.

In the end I should be able to convert your SQLite dumps to the JSON using stdout driver and then import it to my instance. So I'll be able to benefit from these community dumps too :)

Kenan Sulayman · Answer 38 · Tue Jul 07 2020 06:03:42 GMT+0800 (China Standard Time)

@Glandos I can't find any peers to download from -- can you provide me with a direct download link? I'll make sure it's seeded.

Alexey Skobkin · Answer 39 · Tue Jul 07 2020 17:36:11 GMT+0800 (China Standard Time)

@Glandos I remembered yesterday that I have backups of that database in pg_dump custom format. So here it is:
https://mega.nz/file/U15QSCpD#DzCfMNQNRJX21vkGb6gbcAMWLf6ZFmg4ej7JsMXDsAc

Let me know when I can delete it.

Jeremy Kescher · Answer 40 · Tue Jul 07 2020 18:32:32 GMT+0800 (China Standard Time)

@Glandos Your torrent has no peers.

Also, can you take a look at the merge requests for magnetico_merge? I made two a while ago.

Jeremy Kescher · Answer 41 · Tue Jul 07 2020 23:06:05 GMT+0800 (China Standard Time)

@19h Small hint for the future: If you want to share your sqlite3 file, before sure to manually open it with sqlite3 and execute PRAGMA wal_checkpoint(TRUNCATE);. That way, -shm and -wal files are written to the database, and deleted afterwards.

Glandos · Answer 42 · Wed Jul 08 2020 00:16:53 GMT+0800 (China Standard Time)

@19h and @kescherCode I recreated the torrent with announce URIs in it. My client announced it, so now, you should be able to find me. But you need to reimport the torrent (from https://antipoul.fr/dhtd/20200706_13.6M_magnetico-merge.torrent) as I've updated it.

@skobkin I can't download from MEGA because the file is too big, and it requires me to install an extra software. I won't do this, sorry :)

Alexey Skobkin · Answer 43 · Wed Jul 08 2020 00:23:08 GMT+0800 (China Standard Time)

@Glandos I'm removing it then 🤷

Jeremy Kescher · Answer 44 · Wed Jul 08 2020 23:25:41 GMT+0800 (China Standard Time)

My latest dump, containing 13.8 million torrents.
Magnet link available separately here

I will make sure this file is well-seeded by a fast connection as well as my home connection.

Jeremy Kescher · Answer 45 · Wed Jul 08 2020 23:31:32 GMT+0800 (China Standard Time)

@19h see updated dump above.

Kenan Sulayman · Answer 46 · Thu Jul 09 2020 02:19:45 GMT+0800 (China Standard Time)

@Glandos @kescherCode that's amazing, thanks both of you!

Kenan Sulayman · Answer 47 · Thu Jul 09 2020 03:49:44 GMT+0800 (China Standard Time)

@Glandos @kescherCode jfyi I fetched both dumps and my seedbox is seeding them.

Kenan Sulayman · Answer 48 · Thu Jul 09 2020 04:06:55 GMT+0800 (China Standard Time)

I'm seeding your torrents.

Also ... I'm currently writing a merging tool in Rust so that it's a bit faster, but my ideal future of this would be migrating off sqlite to leveldb (or the fb fork rocksdb). I'm also playing with the idea of building a frontend searching the database using tantivy, but that's a bit of a stretch goal ..

It would be cool if we could have a semi-dht where we can interconnect our instances so that they act as isolated sattelites for each other..

sunnymme · Answer 49 · Fri Jul 10 2020 01:37:03 GMT+0800 (China Standard Time)

how to install this project on vps?

Kenan Sulayman · Answer 50 · Sat Jul 11 2020 21:20:28 GMT+0800 (China Standard Time)

@sunnymme this isn't the right place for this question. Check the readme, check other issues or create one ..

sunnymme · Answer 51 · Sat Jul 11 2020 22:50:56 GMT+0800 (China Standard Time)

thanks a lot. I have created a issue for this question. magnetico is really good. I want to setup it. But I don't know how to do it. 

…

------------------ 原始邮件 ------------------ 发件人: "Kenan Sulayman"<notifications@github.com>; 发送时间: 2020年7月11日(星期六) 晚上9:20 收件人: "boramalper/magnetico"<magnetico@noreply.github.com>; 抄送: "1059777607"<1059777607@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [boramalper/magnetico] Community Database Dump (#218) @sunnymme this isn't the right place for this question. Check the readme, check other issues or create one .. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

DyonR · Answer 52 · Sun Jul 19 2020 03:31:31 GMT+0800 (China Standard Time)

Here is the dump of my database. 2.64M torrents. My database is not merged with any other database.
Compressed zst file is 2.5GB. The sqlite3 file is 8.3GB.
You can find my database at https://dyonr.nl/magnetico/
Preferable, use the .torrent to download it, instead of downloading the zst file.
The torrent is loaded on my seedbox (1Gbit/s), the .zst on my server which is limited to 200Mbit/s.

sunnymme · Answer 53 · Sun Jul 19 2020 05:32:21 GMT+0800 (China Standard Time)

Thanks a lot. I will try it. Your are so kindness. Sincercely yours, Sunny

…

------------------ 原始邮件 ------------------ 发件人: "DyonR"<notifications@github.com>; 发送时间: 2020年7月19日(星期天) 凌晨3:31 收件人: "boramalper/magnetico"<magnetico@noreply.github.com>; 抄送: "1059777607"<1059777607@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [boramalper/magnetico] Community Database Dump (#218) Here is the dump of my database. 2.64M torrents. My database is not merged with any other database. Compressed zst file is 2.5GB. The sqlite3 file is 8.3GB. You can find my database at https://dyonr.nl/magnetico/ Preferable, use the .torrent to download it, instead of downloading the zst file. The torrent is loaded on my seedbox (1Gbit/s), the .zst on my server which is limited to 200Mbit/s. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Jeremy Kescher · Answer 54 · Sun Jul 19 2020 07:51:53 GMT+0800 (China Standard Time)

@DyonR I'm seeding your database now, not merging it in yet until the next time I'll dump

Glandos · Answer 55 · Fri Nov 13 2020 04:15:38 GMT+0800 (China Standard Time)

Here is my fresher dump: https://antipoul.fr/dhtd/20201112_14.1_magnetico-merge.torrent

Unfortunately, it seems to be a bit stalling. Sometimes, I have 0.1 torrent per second, but it is usually 10 times less…

Diego Heras · Answer 56 · Sun Nov 15 2020 20:01:34 GMT+0800 (China Standard Time)

Since some of you have millions of torrents maybe you are interested in add support for other databases that scale better than SQLite. Some users are having request timeouts in magneticow due to poor performance in SQLite.

I don't have time to work on this issue, but maybe some of you do. Jackett/Jackett#10174 (comment)

UPDATE: Of course, having a faster backend will increase the discovery/indexing speed too. There is an attempt to include Postgres but I think it's abandoned #214

Alexey Skobkin · Answer 57 · Sun Nov 15 2020 20:44:00 GMT+0800 (China Standard Time)

@ngosang It's not abandoned, it's working for me more than a year for now 😄

I've just forgot about it because Bora didn't answer to my question. I think I can make the last change he asked soon, but I'm not sure he'll merge it because he's not supporting magnetico for a long time.

UPD: You can test it using this Docker image: https://hub.docker.com/r/skobkin/magneticod

Glandos · Answer 58 · Sun Nov 15 2020 20:47:05 GMT+0800 (China Standard Time)

@ngosang commented on 15 nov. 2020 à 13:01 UTC+1:

UPDATE: Of course, having a faster backend will increase the discovery/indexing speed too.

Since magneticod is not using 100% of a CPU, I don't think this is the current bottleneck.

Alexey Skobkin · Answer 59 · Sun Nov 15 2020 20:53:29 GMT+0800 (China Standard Time)

@Glandos SQLite is really a bottleneck sometimes. It'll not use 100% of CPU because it's most likely using 100% of the disk.
Probably you can tweak SQLite when initializing the client to use very big caches and so on, but I'm not sure that it'll outperform MySQL or PostgreSQL though.

I don't have time to check it, so I can be wrong. If someone can check the disk usage (IOPS, throughput, latency) when searching torrents in VERY LARGE database, let us know.

Diego Heras · Answer 60 · Sun Nov 15 2020 21:11:36 GMT+0800 (China Standard Time)

From my experience as software architect if you have a 10GB database, the SQLite read performance is between 100 and 1000 times slower than other relational databases like MySQL, Postgres, Oracle.
If the entire database does not fit in memory then all databases have to read from disk at some point. The difference is that SQLite does not have several levels of cache in memory with the table indexes, most common queries, etc. With each query you have to read much more data from disk than other databases. I saw 32GB exports in this post. You should notice an amazing improvement in both indexing and search.

Alexey Skobkin · Answer 61 · Tue Nov 24 2020 02:53:37 GMT+0800 (China Standard Time)

BTW, I've just updated the PR with PostgreSQL eliminating the last "problem" which was pointed a year ago.

Alexey Skobkin · Answer 62 · Sat Nov 28 2020 01:37:36 GMT+0800 (China Standard Time)

It was merged!

Jeremy Kescher · Answer 63 · Sat Nov 28 2020 03:30:41 GMT+0800 (China Standard Time)

@skobkin Now, how do I migrate my data from SQLite to Postgres? lol

Alexey Skobkin · Answer 64 · Sun Nov 29 2020 06:52:17 GMT+0800 (China Standard Time)

@kescherCode See this comment.

Be aware that magneticow does not work with PostgreSQL as of now.