Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Interrupted mobile uploads leave corrupt files #9964

Open
2 of 3 tasks
SixFive7 opened this issue Jun 3, 2024 · 12 comments
Open
2 of 3 tasks

[BUG] Interrupted mobile uploads leave corrupt files #9964

SixFive7 opened this issue Jun 3, 2024 · 12 comments

Comments

@SixFive7
Copy link

SixFive7 commented Jun 3, 2024

The bug

Every few thousand uploads something goes wrong and the upload is stopped mid file. This results in errors during file processing. As a result these uploads are stuck on the untracked files section of the repair tab.

There are a few issues with this:

  1. Apparently partial uploads are not detected by the upload code. Given that some partial uploads might be able to get processed as if they were a correct file this means there is no way to know if a part of the library is corrupt.
  2. The files that are partial enough (not all!) to get stuck in the untracked files sections are stuck there. There is some (older?) documentation referring to a "Remove Offline Files" job that does not seem to exist for the default user library?
  3. Even though the server knows some files can't be processed (if your are lucky, sometimes it thinks the partials files are just fine). The mobile app incorrectly shows everything is just fine while it's not.

The most egregious issue to me seems to be issue number 1. Especially for broken uploads that don't get detected as corrupt. A relative easy fix for this would be to upload from the app not only the file, but also the checksum. And then only accept the file server side if the checksum checks out. If not, drop the upload and ask the app to try again. Or even simpler, upload to a example.jpg.partial file that is ignored by the server and rename it to example.jpg when it is done.

Update: Seems @ItalyPaleAle already ran into this wall once before. Not sure why #4532 was closed.

The OS that Immich Server is running on

Unraid v6.12.10

Version of Immich Server

v1.105.1

Version of Immich Mobile App

v1.105.0

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

# backup stack

networks:
  default:
    name: backup
services:

  immich:
    container_name: backup-immich
    image: ghcr.io/immich-app/immich-server:release
    command: ['start.sh', 'immich']
    restart: always
    depends_on:
      - redis-immich
      - postgres-immich
    volumes:
      - /mnt/user/immich:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    environment:
      TZ: Europe/Amsterdam
      PUID: 99
      PGID: 100
      REDIS_HOSTNAME: redis-immich
      DB_HOSTNAME: postgres-immich
      DB_DATABASE_NAME: immich
      DB_USERNAME: postgres
      DB_PASSWORD: REDACTED
    labels:
      traefik.enable: true
      traefik.http.services.immich-backup.loadbalancer.server.port: 3001

  microservices-immich:
    container_name: backup-immich-microservices
    image: ghcr.io/immich-app/immich-server:release
    command: ['start.sh', 'microservices']
    restart: always
    depends_on:
      - redis-immich
      - postgres-immich
    volumes:
      - /mnt/user/immich:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["GPU-e1474488-7fa9-85a5-803b-59c645d71e0d"]
              capabilities: [gpu]
    environment:
      TZ: Europe/Amsterdam
      PUID: 99
      PGID: 100
      NVIDIA_DRIVER_CAPABILITIES: all
      NVIDIA_VISIBLE_DEVICES: all
      REDIS_HOSTNAME: redis-immich
      DB_HOSTNAME: postgres-immich
      DB_DATABASE_NAME: immich
      DB_USERNAME: postgres
      DB_PASSWORD: REDACTED

  machinelearning-immich:
    container_name: backup-immich-machinelearning
    image: ghcr.io/immich-app/immich-machine-learning:release-cuda
    restart: always
    volumes:
      - /mnt/user/cache/backup/immich/machinelearning:/cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["GPU-e1474488-7fa9-85a5-803b-59c645d71e0d"]
              capabilities: [gpu]
    environment:
      TZ: Europe/Amsterdam
      PUID: 99
      PGID: 100
      NVIDIA_DRIVER_CAPABILITIES: all
      NVIDIA_VISIBLE_DEVICES: all

  redis-immich:
    container_name: backup-immich-redis
    image: registry.hub.docker.com/library/redis:6.2-alpine@sha256:84882e87b54734154586e5f8abd4dce69fe7311315e2fc6d67c29614c8de2672
    restart: always
    volumes:
      - /mnt/user/cache/backup/immich/redis:/data
    environment:
      TZ: Europe/Amsterdam
      PUID: 99
      PGID: 100

  postgres-immich:
    container_name: backup-immich-postgres
    image: registry.hub.docker.com/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always
    volumes:
      - /mnt/user/containers/backup/immich-postgres:/var/lib/postgresql/data
    environment:
      TZ: Europe/Amsterdam
      PUID: 99
      PGID: 100
      POSTGRES_INITDB_ARGS: '--data-checksums'
      POSTGRES_DB: immich
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: REDACTED

Your .env content

In-lined into the compose file.

Reproduction steps

Attempt mobile uploads on a large number of files and mess with the connection by:

- Rapidly changing between foreground and background settings during the upload.
- Switch between WiFi and mobile networks.
- Kill and restart the app a few times.

Then check the server repair page. There will be some files.

Relevant log output

Can't find the relevant log anymore. But it was something generic about not being able to read the file. This is expected as the file has only been partially uploaded.

Additional information

This is a (intentionally very low res) screenshot of the two files compared. Left the uploaded file. Right the original file. You can clearly see how the file size of about 10% is reflected in the image.
image

@SixFive7
Copy link
Author

SixFive7 commented Jun 3, 2024

Related as these three PRs seem to already contain 80% of the required code:
#9306
#2072
#7135

@nomandera
Copy link

This seems like a direct fit for the issue I am seeing and mentioned elsewhere.

I post specifically since the OP mentions Every few thousand uploads something goes wrong but this can be be made much worse under certain conditions.

In my case after returning from a family vacation to an area of the world with spotty slow internet and daily power cuts I ended up with 0.5TB of untracked files (mainly video) and a very real worry I do not have proper vacation photos and video backup.

Unless I am misinterpreting this issue it could arguably classed as "potential data loss" and if so this is an serious as it gets. Hopefully I am wrong about this.

@alextran1502
Copy link
Contributor

@nomandera it is correct that you are misinterpreting this. Uninterrupted upload doesn't send the complete event to the server so the file will be reupload again

@Snuupy
Copy link

Snuupy commented Aug 6, 2024

@alextran1502

@nomandera it is correct that you are misinterpreting this. Uninterrupted upload doesn't send the complete event to the server so the file will be reupload again

No, interrupted uploads lead to files that will never upload completely and stops very early on (a few MB uploaded each time, then errors out)

then it will try to infinitely reupload every time but will always fail

with error: Immich
Backup error
Failed to backup assets. Retrying...

if you try to do a manual backup, it will still fail:
Screenshot_20240805-220319

@alextran1502
Copy link
Contributor

@Snuupy that error seems to be from your reverse proxy, try local ip

@Snuupy
Copy link

Snuupy commented Aug 6, 2024

@alextran1502 you're right, I connected on local port and it succeeded. I am now confused as to why the reverse proxy broke. Here is the swag config:

server {
    listen 443 ssl;
    listen [::]:443 ssl;

    server_name immich.*;

    include /config/nginx/ssl.conf;

    client_max_body_size 50000M;

    access_log off;

    location / { # web
        include /config/nginx/resolver.conf;

        proxy_buffering off;

        proxy_http_version 1.1; 
        proxy_set_header Host $host; 
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        proxy_redirect off;
        
        # set timeout
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
        send_timeout       600s;
        
        set $upstream_app immich-server;
        set $upstream_port 3001;
        set $upstream_proto http;
        proxy_pass $upstream_proto://$upstream_app:$upstream_port;
    }

}

I will check the nginx proxy configs to see if I can find out why it's broken. Thanks.

@nomandera
Copy link

@nomandera it is correct that you are misinterpreting this. Uninterrupted upload doesn't send the complete event to the server so the file will be reupload again

@alextran1502 excellent and thank you for the fast clarification. I am really glad I asked as it wasn't obvious to me reading all the discussion posts and from my symptoms that this would be the case and I suspect I wasn't alone in this interpretation. I will delete my 500GB of noise now with the confidence of knowing it is just leftovers and non unique.

Final question to close this off in my mind.

Does it follow then that if after cleanup of the leftover files, if they do not reoccur, we can say categorically that the client has subsequently successfully backed these up. Essentially I am just looking for a way to have confidence all my family clients have backed up i.e if I see new client images and no noise I can conclude they are fully backed up.

@SixFive7
Copy link
Author

SixFive7 commented Aug 6, 2024

Reverse proxy issues aside, I still have the bug mentioned in this issue where backup data is corrupted while everything is showing green. I can even still reliably reproduce it.

@nomandera
Copy link

nomandera commented Aug 7, 2024

This seems to directly contradict my assumptions so now I am even more confused. The volume of media is such that I cant manually spot check this is any meaningful way and I am worried.

Update:

I believe I located my prime offender that being my youngest kids phone seems so struggle to upload videos (its not the best phone). This was pretty impactful for me because I ended up with 700GB of corrupt videos stuck which was enough to fill a drive and cause havoc with docker dropping the whole server. If the solution to not accumulating files is complex to solve or far off can I suggest at the very least some sort of space check is added.

Update 2: This issue has not reoccurred for me since I "fixed" the childs phone.

@Torqu3Wr3nch
Copy link

Torqu3Wr3nch commented Sep 30, 2024

Just checking in on this. Does anyone know what the current state of this issue is?

I was very much looking forward to deploying Immich, but any question of data corruption (and even worse, undetected data corruption) on a photo backup solution is an absolute show-stopper. Echoing @nomandera's comments, I can't imagine a worse failure mode.

Does Immich do any kind of checksumming/file verification?

For reference, this is how Nextcloud does it: https://help.nextcloud.com/t/does-the-nextcloud-client-add-checksum-verifications-when-uploading/193040/2

Thank you in advance.

@SixFive7
Copy link
Author

The situation remains unchanged for me; the bug still occurs. It is easily testable as well. I simply sync everything with Immich and then compare the checksums of the folder on the server with the same folder synced by Syncthing. If I kill the app a few times, there are always some partially uploaded files.

I also wholeheartedly agree that silent data corruption is a worst-case scenario bug. It is the main, if not the only, blocker for me as well. That said, I respect the amount of work that goes into a project like Immich, so I will respectfully abide until a developer has time to fix this. When that happens, I can at least provide some thorough testing.

@Torqu3Wr3nch
Copy link

Oh certainly; I too am appreciative of @alextran1502's efforts (in case it's not clear, thank you, Alex). I'm only disappointed because I'm looking forward to using the app but with this kind of a bug, I simply cannot use it yet.

I was actually hoping you would respond, @SixFive7. As I reread your initial post/responses, it seems like the errors/corrupted uploads are in fact detected, you do see all affected files in the untracked errors section of the repair tab, correct?

Apparently partial uploads are not detected by the upload code. Given that some partial uploads might be able to get processed as if they were a correct file this means there is no way to know if a part of the library is corrupt.

The partial uploads are detected though, aren't they? In the sense that you can find them in the repair tab, right? So you could use this to determine if part of the library is corrupt, correct?

The files that are partial enough (not all!) to get stuck in the untracked files sections are stuck there. There is some (older?) documentation referring to a "Remove Offline Files" job that does not seem to exist for the default user library?

So not all partial/corrupted files show up in the untracked section? This is the scenario I am most worried about.

Like I said, I haven't yet deployed Immich, I just don't want to end up in a situation where I think my data is protected but it is not.

Thank you again everyone for your responses in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants