trixi-framework / HOHQMesh

High Order Hex-Quad Mesh (HOHQMesh) package to automatically generate all-quadrilateral meshes with high order boundary information.

Home Page:https://trixi-framework.github.io/HOHQMesh

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tracking too many large files

andrewwinters5000 opened this issue Β· comments

The repo tracks several mesh and plotting .tec files (either in the Benchmarks or Examples folders) causing it to grow in size over time. Here is an list of the largest files tracked by the repo

19M ./Benchmarks/PlotFiles/Benchmarks/SeaMount.tec
14M ./Benchmarks/PlotFiles/Benchmarks/SeaMountCubic.tec
13M ./Examples/3D/Pond/Pond3D.tec
6.5M ./Examples/3D/ScaledCylinder/CylinderScale.mesh
5.4M ./Examples/3D/ScaledCylinder/CylinderScale.tec
5.1M ./Examples/3D/HalfCircleRotated/HalfCircle3DR.mesh
5.0M ./Examples/3D/HalfCircleRotated/HalfCircle3DR.tec
5.0M ./Benchmarks/MeshFiles/Benchmarks/SeaMount.mesh
4.4M ./Examples/3D/CavityRampExtruded/CavityRamp3D.tec
4.0M ./Benchmarks/MeshFiles/Benchmarks/SeaMountCubic.mesh
3.7M ./Examples/3D/Pond/Pond3D.mesh
3.4M ./Examples/3D/HalfCircleExtruded/HalfCircle3D.tec
3.0M ./Examples/3D/MtStHelens/sthelens_grid_data.txt
2.4M ./Examples/2D/LakeSuperior/SuperiorMain.tec
1.8M ./Examples/3D/BoxRotated/Box3DRot.mesh
1.8M ./Examples/3D/Box/Box3D.mesh
1.8M ./Examples/2D/GingerbreadMan/GingerbreadManPlot.tec
1.7M ./Examples/2D/EastCoastUS/EastCoastUS2d.tec
1.5M ./Examples/3D/CavityRampExtruded/CavityRamp3D.mesh
1.5M ./Examples/3D/BoxRotated/Box3D.tec
1.4M ./Examples/3D/BottomFromFile/BottomFromFile.tec
1.2M ./Examples/2D/IndianOcean/IndianOcean.mesh

As more examples are added with future development the size of the repo will become quite large and unwieldy to work with. For instance, with my recent attempt to build a new tarball and release a new version tag, Utilities/createrelease created HOHQMesh-v1.3.0.tar.gz that was 97.5M. This is too large to attach to a github release.

πŸ™€πŸ™€πŸ™€

To be honest, I'd suggest to remove all examples larger than a given size and/or move them to a separate repository. I would also suggest to remove all tec files since they blow up the repo size disproportionately. Then, we should bite the bullet and rewrite repo history using the BFG repo cleaner, but only since the last release (which is bad enough).

Would it make sense to just put these files into a separate git repo? with "examples" etc?

I agree with Michael and maybe further. The examples are mostly for convenience and are not used for anything in particular, like in the testing. So all the plot and mesh files can be untracked, I think. The user can run the control files and get the meshes themselves. They are all real fast. I never thought about the size of the repo. After all, the slack app on my machine is 434MB just itself. That's not to say I'm against the idea of a separate report of plot files that the user can download individually if desired to look. Plus it might be nice to have the reference values.

A separate git repo or blob storage (e.g. Google Cloud Storage) would work well for storing tecplot and other files that can be used to verify results. If it's less than a few GB, I'm fine hosting them on GCS with public links made available.

Most of the 2D meshes and plot files are smallish, say less than 4MB. The 3D plot files can be the real killer. For example, the Mt. St. Helens SEM formatted plot file is approximately 200MB. Altogether I think we can safely assume (for the time being) that it is less than a few GB

Great. I'll get something up today.

Great. I'll get something up today.

Sounds good, I have untracked everything in PR #47 but all the files still live on main for the time being.

Not wanting to crash the party, but why not use another git repo, e.g., HOHQMesh_examples? The issue I see with using GCS that it boils down to @fluidnumerics-joe having to handle the file management all by himself (uploading new files, deleting unused files, replacing changed files). Thus I suggest to at least give it one more try to find a solution that distributes the responsibility among a few more shoulders. Maybe @andrewwinters5000 has something like Sciebo, where you can upload up to 500 GiBs of files and share access with others?

Having said that, if it is "just" example files that are not referenced on a per-file basis from the repo (i.e., not used during, e.g., CI testing), it might be OK to use some external storage that is managed by a single person, although I do not recommend it.

Now I am getting confused. Is the plan then to have a separate repo for the *.mesh, *.inp and *.tec files while we keep all the control files in the Examples or Benchmarks folders on HOHQMesh?

Not wanting to crash the party, but why not use another git repo, e.g., HOHQMesh_examples?

Nothing wrong with that approach either. I'm not sure what to cost is for git-lfs these days on Github though.

I can set up the project where more folks have access to manage files, if that's something we want to do. I'm not sure how Sciebo solves this silo'd management.

Another thing I just figured out is that the Utilities/createrelease will include experimental control, mesh, or tec files from untracked files. That is, the check that the repository folder is clean is currently commented out

# Ensure that there are no uncommited files/directories *except* the FTOL sources
printf "Ensure directory is clean and FTObjectLibrary is present... "
# if [[ "$(git status --ignored --porcelain)" != "!! Contrib/FTObjectLibrary/" ]]; then
#   echo "ERROR: directory is not clean or missing the FTObjectLibrary directory" >&2
#   git status --ignored --porcelain
#   exit 2
# fi
echo OK

This is exacerbating the problem.

For instance, if I make a clean clone of HOHQMesh and build a tarball the repo is approximately 20Mb. So I still think that removing some of the large files (like SeaMount or Pond3D) from the .git is worthwhile and untracking the mesh / tec files as in #47 is good for the future. Setting up a separate repo is still a good idea, but now that I have tracked down the cause of the large tarball (i.e. my own stupid user error) we should fix the createrelease function first.

I can set up the project where more folks have access to manage files, if that's something we want to do. I'm not sure how Sciebo solves this silo'd management.

That would sound good to me as well. In Trixi we just have a shared folder where the core team has write access and to which we give read access via link sharing. If something like this is possible with GCS, it would be a good solution I think.

One reason I think that the clean directory check was commented out is because of MacOS and the .DS_Store that is creates. This will always cause the directory to be seen as "unclean" by the check git status --ignored. I am currently investigating for a workaround.

I am currently investigating for a workaround.

Maybe something along the lines of

if [[ "$(git status --ignored --porcelain | grep -v '.DS_Store')" != "!! Contrib/FTObjectLibrary/" ]]; then

?

However, maybe you could also add here something like

# Delete .DS_Store files (relevant on macOS only)
printf "Delete .DS_Store files... "
find "$releasedir" -name .DS_Store -exec rm {} \;
echo "OK"

This still throws an error where DS_Store is reported, e.g.,

!! .DS_Store
!! Benchmarks/.DS_Store
...

and the creation fails.

No, I mean you need both: The first change will make sure that the check passes, the second one ensures that you do not include .DS_Store files in the release.

I added both, it is what gave me the error

Also, if we chose to ignore, e.g., *.tec we need similar workarounds such that the createrelease command does not spuriously think that the current folder is "unclean"

I added both, it is what gave me the error

What is the output of git status --ignored --porcelain?

Also, if we chose to ignore, e.g., *.tec we need similar workarounds such that the createrelease command does not spuriously think that the current folder is "unclean"

Isn't it the other way around? Since we do not want to include those files in our repo anymore, they also shouldn't be part of a release. The unclean check was originally intended to prevent exactly this - I don't know why I commented it out πŸ€·β€β™‚οΈ

What is the output of git status --ignored --porcelain?

It is always a bit annoying editing / testing the createrelease. Since it is tracked making changes to is causes git status --porcelain to fire.

I tried adding the following to remove these file types from the tarball

# Delete mesh and tec plotting files
printf "Delete mesh and tec files... "
find "$releasedir" -name *.mesh -exec rm {} \;
find "$releasedir" -name *.inp -exec rm {} \;
find "$releasedir" -name *.tec -exec rm {} \;
echo "OK"

This shrinks it significantly to be around 2Mb

The problem with adding those lines is that they delete also files that are currently committed in main. IMHO a release should usually contain all files in the repo, as otherwise one could ask why we have them in the repo at all? But this is my personal opinion and I wouldn't want to meddle with your release maintainer mojo 😬

IMHO a release should usually contain all files in the repo

I totally agree. But if we were to untrack these files as in #47 wouldn't this be okay? Then a user could have local copies of mesh and tecplot files as they use HOHQMesh but building the release would not include them.

What is the output of git status --ignored --porcelain?

I realized I didn't fully answer this. From main on my machine this is the output:

?? Examples/3D/MtStHelens/MtStHelens.inp
?? Examples/3D/MtStHelens/MtStHelens.tec
!! .DS_Store
!! Benchmarks/.DS_Store
!! Benchmarks/BenchmarkData/.DS_Store
!! Benchmarks/MeshFiles/.DS_Store
!! Benchmarks/MeshFiles/Benchmarks/.DS_Store
!! Benchmarks/MeshFiles/Tests/.DS_Store
!! Benchmarks/PlotFiles/.DS_Store
!! Benchmarks/PlotFiles/Benchmarks/.DS_Store
!! Benchmarks/PlotFiles/Tests/.DS_Store
!! Benchmarks/StatsFiles/.DS_Store
!! Benchmarks/StatsFiles/Tests/.DS_Store
!! Contrib/FTObjectLibrary/
!! Documentation/docs/authors.md
!! Documentation/docs/building-the-documentation.md
!! Documentation/docs/index.md
!! Documentation/docs/license.md
!! Examples/.DS_Store
!! Examples/2D/.DS_Store
!! Examples/3D/.DS_Store
!! Examples/3D/BottomFromFile/.DS_Store
!! Examples/3D/MtStHelens/.DS_Store
!! Examples/3D/Pond/.DS_Store
!! Examples/3D/Snake/.DS_Store
!! Source/.DS_Store
!! Source/Project/.DS_Store

But if we were to untrack these files as in #47 wouldn't this be okay?

Yes, if y'all decide that we will have absolutely no .inp/.tec/.mesh files in the repo, this would be ok.

Okay, I will just fix the annoying .DS_Store thing for the time being and prepare a new release for the Gaussian curvature feature. Now that I know the very large tarball was from my own user error this large file tracking is not as severe an issue as I originally thought.

FYI, I just remembered we had the issue of large(ish) example files also for ReadVTK.jl, where we used a separate repo ReadVTK_examples to hold the example files.