3

I got many sdcards corrupted on my raspberries. In most of the cases, I detect it performing an fsck on sd card partitions. I did theses tests because system became unstable.

In some cases I tried to repair volume with fsck or other tools. That didn't last long. As someone said :

fsck definitely causes more harm than good if the underlying hardware is somehow damaged; bad CPU, bad RAM, a dying hard drive, disk controller gone bad... in those cases more corruption is inevitable.

SDCard corruption has two main causes :

  1. brutal power off when some writes are done on sd card
  2. too many writing operations on flash memory

For my case, 2. is the cause (because system is always powered of cleanly).

Now, I'm looking for a mean to detect/anticipate sd-card "dying" in order to make appropriate actions such as automatic mail alert, partition back up... System is running, so I cannot get SD Card and check it externally.

In this post, I'm not trying to find some solutions to prevent sd card corruptions, I just want to detect when my SD Card is dying.

I thought about :

  • doing performing reboot and running fsck during startup, checking result later and make appropriate actions if some errors are detected. But how can I get fsck results? I'm running an arch distro, here is fsck info for arch
  • doing fsck periodically on mounted root partition (without any reboot), how can I do it cleanly?

Do you have other suggestions?

rem
  • 183
  • 1
  • 6
  • Wow - i've got about 20 Pi's and have never had a corrupt sd card. I always use sandisk cards. What power supply are you using? Do you ever get the lightning symbol? – CoderMike Oct 11 '17 at 08:02
  • I'm using raspberry 3, official raspberry 3 power supply, sandisk class 10 sdcards. But as I said above, this post is not about preventing sd card corruption. In all cases, sd card have a limited number of write cycles, depending on number of writes done by your system, it is possible that you never reach this limit. In this post, I just want to automatically detect when this limit is reached on a system running from sd card. – rem Oct 11 '17 at 08:06
  • Are you using Raspbian Stretch ? – CoderMike Oct 11 '17 at 08:27

3 Answers3

1
  1. You provide no evidence that the "SD Card is dying". While I and a few others have had very rare card failures (usually juvenile failures, replaced under warranty) most use their cards quite hard, with no problem.

  2. Image/partition problems are also uncommon, but do happen. After repair or restoring image they work with no problem. Again most of the experienced users never have any problem, even when power failures happen.

In fact the SD Card firmware hides most problems from the OS so it is not possible to determine without specialised software which has access to proprietary data.

SD Cards DO NOT have a limit on write cycles, individual cells DO have a limit, in practice this means blocks of the Erase Block Size (usually 4M) can fail and Cards have "spares" which the firmware manages, and good firmware uses write levelling to ensure that bad cells are retired, and all experience equivalent use.

Milliways
  • 59,890
  • 31
  • 101
  • 209
  • I'm not sure SDCard is dying. But I'm sure about fsck result. If I get some errors, something went wrong... On a system that has always been shut of cleanly, I supposed problem root cause was limited number of sd card write cycles/cell (many services are logging, some influx databases are modified...). Now, I want to automatically detect when some errors are found on fs. – rem Oct 11 '17 at 09:21
  • @rem The FS (usually ext4) uses journalling to "correct" errors. This is independent of the storage medium. Your assumption is just that; the OS has NO KNOWLEDGE of the nature of the medium, the SD Card firmware maps between sectors (probably blocks) and storage. There is some evidence that different firmware exhibit different performance. – Milliways Oct 11 '17 at 09:32
1

As someone said: "fsck definitely causes more harm than good if the underlying hardware is somehow damaged; bad CPU, bad RAM, a dying hard drive, disk controller gone bad... in those cases more corruption is inevitable."

Not every "someone's" opinion online is worth the pixels it is printed with. While fsck obviously can't repair damaged hardware, the idea that it will do "more harm than good" is paranoid to the point of delusion. Is it conceivable? Yes, but in that case anything you do with the card/storage including powering it on could create additional corruption, because e.g., bad RAM creates the potential for anything to happen. By far the most likely thing, however, is simply that executables will fail randomly. If this is fsck in the middle of a repair, that will mean the repair is not completed, not that more damage is done.

Fsck is a tool for repairing corrupted filesystems. If you think it is unsafe to use for this purpose, then you should turn off your computer now and leave it off, because as long as it is okay everything is fine, but if something goes wrong you may not be able to use it.

I've had more spinning disks (i.e., traditional harddrives) eventually fail on me than SD cards, but I have had both. This is to be expected, although in theory arguably less so with SD cards (a quality card could, in theory, last through decades of normal use1).

In any case, when secondary storage has failed due to hardware issues, fsck will probably fail. It probably will not come back and tell you it fixed errors it cannot fix. This is what I've observed with hard drives. With the one SD card I've had go bad, what happened is it did appear to make some corrections -- but I like to run it twice when that happens, just to be sure. In this case, it simply keeps saying the filesystem needs repair, and if you allow it to make repairs it may seem to do so, but if you check again immediately it will do something similar again. If this happens, consider your card defunct.

Another clue on linux with regard to such failure is, if serious enough (with hard disks the problem usually grows over a period of days or weeks; I think SD cards at the end of their natural lifespan will do the same), is that processes will start falling prey to uninterruptible sleep. This is a situation caused by a chicken and egg problem with secondary storage I/O. If these are critical things, the system can freeze up periodically.

Fortunately, this is easy to spot using process diagnostic tools such as top or ps; such processes will be in a D state. Furthermore, there will be copious messages in the system log. Again, at this point consider the card defunct. Hard drives can be revived a bit by tagging "bad blocks" (if they are generally wearing out though, more blocks will soon follow), but the OS cannot do this effectively for SD cards.


1. That's based purely on the lifetime of the flash memory; I think I have read somewhere that if frequently inserted and removed the contacts can wear out before anything else.

goldilocks
  • 58,859
  • 17
  • 112
  • 227
1

I try to use a USB HDD wherever I use SQL or other write intense application. Generic approach suitable for the most cases:

1) Make 3 directories (/home, /srv, /var) on USB HDD and copy everything from coresponding SD Card directories to them.

2) Mount USB HDD directories over originals by editing fstab and adding lines:

/mnt/usb_hdd/home    /home    none defaults,bind 0 0
/mnt/usb_hdd/srv    /srv    none defaults,bind 0 0
/mnt/usb_hdd/var    /var    none defaults,bind 0 0

/run should already be mounted as tempfs, so leave it be

This way, all write-intense stuff is transfered to USB HDD leaving SD card free of wear.

Fabian
  • 1,260
  • 1
  • 10
  • 19