boramalper / magnetico

Autonomous (self-hosted) BitTorrent DHT search engine suite.

Home Page:http://labs.boramalper.org/magnetico/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Community Database Dump

anindyamaiti opened this issue · comments

I started fresh last month, and have ~1.7M torrents in my database after about 3 weeks. And, I plan to keep my magneticod running for the foreseeable future, expecting to add about ~50K per day after the initial spike converges.

There have been requests for database dump before, but no one has shared theirs in my knowledge. So, I thought of taking the initiative. Here is my website where I will share (via .torrent) my database dump 1-2 times a month: https://tnt.maiti.info/dhtd/

You can use it as-is or to get a head start with magneticod. And, don't forget to seed!

Huge thanks to @boramalper for making this project happen.

I will soon create a page under https://kescher.at/magneticod-db for sharing my own SQLite database backups.

Thanks @kescherCode. I suggest you take the same approach of sharing via .torrent.

As far as GitHub is concerned, I am confident that sharing an external webpage that contains .torrent files of self-created databases is not in violation of any rules. Neither the webpage posted here nor the torrent/database contains any copyrighted material.

That's good news.

Someone is brave enough to write a small script to merge 2 databases?

That's good news.

Someone is brave enough to write a small script to merge 2 databases?

https://github.com/kd8bny/sqlMerge

Alternatively:
https://stackoverflow.com/a/37138506/11519866

I have put up a page at https://kescher.at/magneticod-db now :)

That's good news.
Someone is brave enough to write a small script to merge 2 databases?

https://github.com/kd8bny/sqlMerge

Alternatively:
https://stackoverflow.com/a/37138506/11519866

There is a real issue here.
First, the sqlMerge repository doesn't work even by merging kd8bny/sqlMerge#3 and kd8bny/sqlMerge#4 since it relies on str(row) which doesn't work on BLOB.

Then, I explored manual merging. Obviously, torrents.id needs to be modified, but the issue is that there is a foreign key from files.torrent_id to torrents.id, so that the script should:

  • Read a record from merged_db
  • Insert it in original_db and record the insertion id
  • Modify merged_db.files.torrent_id according to the new id, and merge the merged_db.files rows.

This is not so difficult, but I don't think it can be done using a generic script.

commented

It may be ideal to compress the database before sharing.

Compressing with LZMA2 (7zip) on the 'fastest' preset (64K dictionary, 32 word size) yields a file 23.3% of the size of the uncompressed database.

Compressing on the 'normal' preset (16M dictionary, 32 word size) yields a file 15-18.2% of the size of the uncompressed database.

I'd suggest using the following command on linux systems with xz-utils

xz -7vk --threads=12 database.sqlite3

7 is the compression level (out of 9)
v is for 'verbose', shows you the progress of the compression in your TTY
k is for 'keep', meaning it wont delete your database after the compression is done >_>
--threads=12 specifies to use 12 threads, you can use --threads=0 to tell xz to use as many threads as there are CPUs on your server.

This will produce a file named database.sqlite3.xz

It can then be decompressed after being downloaded using unxz -v database.sqlite3.xz (or -vk if you want to keep the archive).

I've taken the liberty of compressing both of the shared databases so far: https://public.tool.nz/DHT/

@AlphaDelta I agree, and I will compress my future torrents containing a database.sqlite3 with xz, however probably with the most aggressive preset you've ever seen:

xz --lzma=dict=1536Mi,nice=273

commented

@kescherCode That is indeed the most aggressive preset I've ever seen 😂

Just keep in mind the entire dictionary is loaded into memory when decompressing, so it would require allocating 1536MiB of memory just to decompress the database.

Probably not worth riding the exponential-cost train that far.

That's a very significant size reduction using xz!

I will share both compressed and uncompressed versions from next time. Those who have a low-memory VPS for hosting magneticow, may want to directly download the uncompressed database.

Someone is brave enough to write a small script to merge 2 databases?

BTW I was writing a simple and dirty tool to migrate old magneticod (Python) database data to the new version (including PostgreSQL and any another supported engine). It's not optimized yet and uses too much memory (loads all torrents from the database at once).

But if you're not scared of bad and not optimized code you can try to use it.

UPD: just pushed small README.md update and added ability to use not merged yet postgres and beanstalk engines as well as upstream's sqlite3 and stdout.

Here it is : https://framagit.org/Glandos/magnetico_merge
It's hosted on another instance of GitLab, but you can login with your GitHub account if you want to contribute. If you want to fork it on GitHub, please let me know so I can follow your improvements :)

@Glandos It worth mentioning that it's working only with SQLite databases.

Yes indeed, but it is the only database currently supported :) And the only database that is shared. Sharing postgresql database for merging is not complex, but different.

Merging Magneticod bootstrap 2019-10-14/database.sqlite3 into database.sqlite3
Gathering database statistics: 4835749 torrents to merge.
  [######################################################]  4836000/4835749
Comitting… OK. 4835749 torrents processed. 2820832 torrents were not merged due to errors.

Here it is. Now, I have a big merged database with your both database. More than 7 millions torrents with more than 216 millions file entries.
I should share this database when I'll have time.

7 millions torrents

Did you not remove duplicates? I would guess that there would be significant overlap between the databases.

There are a lot of overlap as you saw in the merge report : 2820832 torrents were not merged due to errors. The message is not clear, but it usually means that some constraint was violated, and the merge insert was skipped.

Did you not remove duplicates?

What do you call a duplicate? Torrents with the same infohash couldn't be inserted again.

Here is mine: https://antipoul.fr/dhtd/ This is very basic.
It includes databases from https://tnt.maiti.info/dhtd/ and https://kescher.at/magneticod-db at the time of writing.
It is huge (21GB after decompression), but it works on my Atom D2550, so it should work anywhere.

@anindyamaiti Thanks for your regular updates. Your page is very nice. Do you think you can add an RSS feed? I know it's another thing to do :)

Pinned! I think once we implement import & export functionality, it'd be even easier (and portable across different databases). =)

Closing because it's not an issue but feel free to keep the discussion & sharing going.

Do you think you can add an RSS feed?

@Glandos I was thinking of the same. Here is a basic automated RSS feed of the ten most recently added files: https://tnt.maiti.info/dhtd/rss.php

Nothing fancy, just the filenames, but it should be good enough for a notification.

If anyone else is interested in incorporating RSS for their shares, here is my (dirty) PHP code: https://tnt.maiti.info/dhtd/rss.php.txt

Pinned!

@boramalper thanks for the pin! 😊

@boramalper suggested I point to torrents.csv, an open repository of torrents / global search engine. Here's the issue for potentially adding people's data to this.

I have a new dump of my own: https://antipoul.fr/dhtd/20200203_9.2M_magnetico-merge.torrent

Since my server is really low on CPU, I didn't use XZ and switch to zstandard. The output is larger, but much faster to compress / decompress.

No tracker inside, so I will be the first swarm.

I, too, have released a new dump on https://kescher.at/magneticod-db.
It has around 6.4 million torrents in it.
It uses zstandard compression from now on as well.

Obviously not relying on trackers either, just the DHT.

In case your client allows manual adding of peers and your client doesn't seem to find a connection, feel free to add 185.21.216.171:55451 as peer.

Also, I may seed other dumps here in order to increase availability for people that want to bootstrap their db, hence why I call my torrents "Magneticod bootstrap".

I have released a new dump, having roughly 10.8 million torrents.

You can get it here.

If you can't find any peers through DHT, add 185.21.216.171:55451 as peer if your client allows it.

@kescherCode could you update your dump? I'd offer to host it as a direct download on one of my servers.

I considered sharing my version but figured your public magnetico instance has over 11.3 million torrents now which makes my 8 million look rather pale in comparison.

@19h I will soon create a new torrent, yes. However, do feel free to share your dump as well, as dumps can be combined together to create a bigger database. Some people find torrents my instance can't find ;)

If you interested I can share dump of an instance which uses PostgreSQL to store collected data.

magneticod=# SELECT COUNT(id) FROM magneticod.torrents;
  count   
----------
 14546639

@kescherCode I tar-balled my dump for you here: https://r1.darknet.dev/magnetico-20200705-22b4c048c924e825d147a3fce7cb43f826fae221.tar (24'993'610 KB ~ 24G; server has a unmetered 10Gbit uplink, go wild).

@skobkin I'd love to have that! If we merge all our dumps, we may get a new super database. Would be exciting!

@skobkin I'll be glad to try to merge a dump from you postgres. I wrote the merger in python, that is only able to read sqlite for now. I guess that importing a dump in postgres and then merging is too big, so I'll try to read the dump in itself. But for now, the output will still be sqlite3.

EDIT: if you can go with pg_dump in custom format, I'll be able to use https://github.com/gmr/pgdumplib

@19h I am currently merging your dump, thanks! But next time, you should at least compress it with gzip or zstd :)

@Glandos sorry about that. The server has more bandwidth than CPU.. :-)

@Glandos Dump is being made right now. But it's in plain format. I probably can make another custom dump later.

I've also thought that I could've made import/export feature for my magnetico-web. As I said in #197, the already supported JSON Lines format would be one of the best solutions to reach maximum interoperability between database backends and other implementations even not working with magnetico itself.

Already created issue for myself :)

OK, my server also has a small CPU (Atom D2550), but here is my fresh dump, consolidated with all known database to date: https://antipoul.fr/dhtd/20200706_13.6M_magnetico-merge.torrent

@Glandos Amazing! Thanks!

@Glandos I've almost wrote you a question about possibility to implement JSON export in your tool to be able to convert SQLite databases to the JSON files.

But then I realized that I've already did write a migration tool which uses magnetico's own database persistence layer to store data. So it should also work with you SQLite dumps or it it wouldn't it should be easy to fix because there was very slight difference between old and new magnetico database schemas.

When I have time I'll try to download one of your dumps and check if it works with my migrator tool.

If it works, then we should be able to export also any (new) magnetico database to any of supported formats: JSON (stdout), PostgreSQL (postgres if using my fork) or any other backend which will be implemented in magnetico.

In the end I should be able to convert your SQLite dumps to the JSON using stdout driver and then import it to my instance. So I'll be able to benefit from these community dumps too :)

@Glandos I can't find any peers to download from -- can you provide me with a direct download link? I'll make sure it's seeded.

@Glandos I remembered yesterday that I have backups of that database in pg_dump custom format. So here it is:
https://mega.nz/file/U15QSCpD#DzCfMNQNRJX21vkGb6gbcAMWLf6ZFmg4ej7JsMXDsAc

Let me know when I can delete it.

@Glandos Your torrent has no peers.

Also, can you take a look at the merge requests for magnetico_merge? I made two a while ago.

@19h Small hint for the future: If you want to share your sqlite3 file, before sure to manually open it with sqlite3 and execute PRAGMA wal_checkpoint(TRUNCATE);. That way, -shm and -wal files are written to the database, and deleted afterwards.

@19h and @kescherCode I recreated the torrent with announce URIs in it. My client announced it, so now, you should be able to find me. But you need to reimport the torrent (from https://antipoul.fr/dhtd/20200706_13.6M_magnetico-merge.torrent) as I've updated it.

@skobkin I can't download from MEGA because the file is too big, and it requires me to install an extra software. I won't do this, sorry :)

@Glandos I'm removing it then 🤷

My latest dump, containing 13.8 million torrents.
Magnet link available separately here

I will make sure this file is well-seeded by a fast connection as well as my home connection.

@19h see updated dump above.

@Glandos @kescherCode that's amazing, thanks both of you!

@Glandos @kescherCode jfyi I fetched both dumps and my seedbox is seeding them.

I'm seeding your torrents.

Also ... I'm currently writing a merging tool in Rust so that it's a bit faster, but my ideal future of this would be migrating off sqlite to leveldb (or the fb fork rocksdb). I'm also playing with the idea of building a frontend searching the database using tantivy, but that's a bit of a stretch goal ..

It would be cool if we could have a semi-dht where we can interconnect our instances so that they act as isolated sattelites for each other..

how to install this project on vps?

@sunnymme this isn't the right place for this question. Check the readme, check other issues or create one ..

commented

Here is the dump of my database. 2.64M torrents. My database is not merged with any other database.
Compressed zst file is 2.5GB. The sqlite3 file is 8.3GB.
You can find my database at https://dyonr.nl/magnetico/
Preferable, use the .torrent to download it, instead of downloading the zst file.
The torrent is loaded on my seedbox (1Gbit/s), the .zst on my server which is limited to 200Mbit/s.

@DyonR I'm seeding your database now, not merging it in yet until the next time I'll dump

Here is my fresher dump: https://antipoul.fr/dhtd/20201112_14.1_magnetico-merge.torrent

Unfortunately, it seems to be a bit stalling. Sometimes, I have 0.1 torrent per second, but it is usually 10 times less…

Since some of you have millions of torrents maybe you are interested in add support for other databases that scale better than SQLite. Some users are having request timeouts in magneticow due to poor performance in SQLite.

I don't have time to work on this issue, but maybe some of you do. Jackett/Jackett#10174 (comment)

UPDATE: Of course, having a faster backend will increase the discovery/indexing speed too. There is an attempt to include Postgres but I think it's abandoned #214

@ngosang It's not abandoned, it's working for me more than a year for now 😄

I've just forgot about it because Bora didn't answer to my question. I think I can make the last change he asked soon, but I'm not sure he'll merge it because he's not supporting magnetico for a long time.

UPD: You can test it using this Docker image: https://hub.docker.com/r/skobkin/magneticod

@ngosang commented on 15 nov. 2020 à 13:01 UTC+1:

UPDATE: Of course, having a faster backend will increase the discovery/indexing speed too.

Since magneticod is not using 100% of a CPU, I don't think this is the current bottleneck.

@Glandos SQLite is really a bottleneck sometimes. It'll not use 100% of CPU because it's most likely using 100% of the disk.
Probably you can tweak SQLite when initializing the client to use very big caches and so on, but I'm not sure that it'll outperform MySQL or PostgreSQL though.

I don't have time to check it, so I can be wrong. If someone can check the disk usage (IOPS, throughput, latency) when searching torrents in VERY LARGE database, let us know.

From my experience as software architect if you have a 10GB database, the SQLite read performance is between 100 and 1000 times slower than other relational databases like MySQL, Postgres, Oracle.
If the entire database does not fit in memory then all databases have to read from disk at some point. The difference is that SQLite does not have several levels of cache in memory with the table indexes, most common queries, etc. With each query you have to read much more data from disk than other databases. I saw 32GB exports in this post. You should notice an amazing improvement in both indexing and search.

BTW, I've just updated the PR with PostgreSQL eliminating the last "problem" which was pointed a year ago.

It was merged!

@skobkin Now, how do I migrate my data from SQLite to Postgres? lol

@kescherCode See this comment.

Be aware that magneticow does not work with PostgreSQL as of now.