My user account doesnt have sudo despite being in sudoers. I cant run new commands i have to execute the binary. Grub takes very long to load with “welcome to grub” message. I just wanted a stable distro as arch broke and currupted my external ssd

  • tal@lemmy.today
    link
    fedilink
    arrow-up
    2
    arrow-down
    1
    ·
    edit-2
    1 year ago

    Great. Well, I mean, bad, but that does narrow it down. So that drive is probably failing, but it can read from some places on the drive…just not all. And it fails pretty early, just a few KB into the partition. Though I don’t know why you wouldn’t get a kernel log message about that.

    Well, if we’re really lucky, maybe it just has a bad sector at that one critical location, and everything else is fine. Well, I’m not sure I’d trust a drive once it starts getting read failures, but point is that other data there might be readable. My understanding – which dates to rotational drives – is that normally hard drives maintain a map of sectors and a certain limited store of spare “good” sectors on the drive. When they write, if there’s an error in writing, they switch to a “good” sector, mapping the location to that “good” sector so that, internally, every time you try to touch that location on the drive, the drive is actually using a different physical location. So even writing to that spot on the disk – though I don’t know if it’s something that can be regenerated – may cause the location to be readable again, because a drive will remap the sector to different physical underlying storage.

    I understand that SSDs – which are more free to remap sectors than rotational hard drives, for which it is expensive in time to send the head careening around the drive to weird, non-sequential sectors – use something called “wear leveling”, and regularly remap what’s there, as they don’t care about things being physically-contiguous and one can only write so many times to a given spot on an SSD, and this spreads out the places that are getting written to many times. So if one sector on an SSD starts failing, I’d be a little concerned about others going too.

    So, a couple things that we can maybe experiment with. Maybe we start reading some distance into the drive, we can get some idea of what portion of the partition isn’t readable.

    dd is defaulting to reading in blocks of 512 bytes at a time. It manages to read 16 512-byte-size blocks into the partition, gets 8KiB of data, and then reading the 17th block is a problem. Maybe try:

    # dd if=/dev/sdd1 status=progress skip=1024 of=/dev/null
    

    That’ll skip over the first 1024 512-byte blocks – that is, 512 KiB in), and start reading from that point. If the drive can’t read from there, then you’ll get an error, and if the drive can, then it’ll read for at least a ways.

    If the manual typing isn’t a prohibitive problem with the CP, you can do a binary search for the end of the bad portion. So, we know that block 16 is good. We know that block 17 is bad. We don’t know what extent of the partition the “bad” covers – could be 1 block, could be the rest of the partition, could be an interspersed collection of failing and non-failing sectors. If it’s just one short range, it might be possible to recover what’s there.

    So, I’d start at 1024. If dd can’t read anything 1024 blocks in, then I’d double the “skip=” parameter to 2048, and try again. At some point, if you keep doubling the number, hopefully you’ll get readable data (hopefully the rest of the partition). If it’s readable, then cut in half the distance between the first-known “bad” block (currently 17) and the first-known “good” block. So, it’d look something like this, if hypothetically our bad range is 17-1500:

    Furthest-known “bad” block First-known “good” block after region Trying Result Notes
    N/A N/A 0 Read error after 16 blocks Our first run
    17 N/A 1024 Read error immediately Trying with the skip=1024 I suggested above
    17 N/A 2048 No errors Now we have our first known “good” block after the “bad” portion, at 2048
    17 2048 1032 Error immediately 1032 is (17+2048)/2
    1032 2048 1540 No errors 1540 is (1032+2048)/2
    1032 1540 1286 Error immediately 1286 is (1032+1540)/2

    The commands there would be something like:

    # dd if=/dev/sdd1 status=progress of=/dev/null
    # dd if=/dev/sdd1 status=progress skip=1024 of=/dev/null
    # dd if=/dev/sdd1 status=progress skip=2048 of=/dev/null
    # dd if=/dev/sdd1 status=progress skip=1032 of=/dev/null
    

    …etc. At some point, the first two numbers, the furthest-known “bad” and the first-known “good” will converge to a single block – which for our hypothetical example, would be block 1500 – and we know the end of the “bad” region (assuming that it is a contiguous bad region…we might skip over some good data).

    I’d at least try a couple commands to get an idea of whether the whole disk is hosed or just a tiny portion at the start. If a lot of it isn’t readable and can’t be made to be readable, then it’s going to be tough to recover. If it’s a tiny amount of data at the beginning of the drive, that might not be so bad.

    Maybe only try to copy a limited number of blocks each time, so …for 5MiB, that’d be count=10240, so something like:

    # dd if=/dev/sdd1 status=progress count=10240 skip=1024 of=/dev/null
    

    Then you don’t have to whack Control-C to cancel it if most of the drive is “good” data.

    If there isn’t a whole lot of “bad” data, an option to try to pull all accessible data off the drive might be to try ddrescue. In Debian, this is in the gddrescue package package. It will attempt to read from a block device, like your /dev/sdd1 partition, and write what it can read to another file or device. It’ll retry places where it gets a read error, log where errors are in a “mapfile”, and then move on to try to extract as much data from a device that is seeing hardware failures as possible. It’s possible to try that. Unfortunately, I don’t have a device that spits out read errors handy to try it out on, so I can only give you commands looking at the man page, can’t test them out here. I also haven’t used it before myself to recover data from a drive, since I haven’t run into your “some of the drive is readable, some isn’t” scenario. I believe that it used to be more-popular in the burned CD era, where sometimes similar problems would show up.

    You will also want to have a larger drive to be able to store the output from ddrescue on. While I don’t know whether reads will exacerbate problems for the SSD, for all I know, the drive might, as a whole, go belly up at some point, and reads might be an input into that, so it might be a good idea to, if the aim is to try to grab what can be grabbed from the drive, not do this a huge number of times.

    Another option would be to try to do the recovery directly on the problematic drive – like, if only a small area is bad, it might be possible to write 0s or something to the bad range, hopefully make the area readable again, and hope that nothing in the bad region is critical for e2fsck to need to do the repair. If it’s worth getting another drive to dump this onto first to you, though, and the existing drive doesn’t have too much “bad” data, I’d probably do so and then try to repair the filesystem on that drive, as that would be less-intrusive to this drive, which I’d be inclined not to trust a whole lot. Worst case, it isn’t repairable and then one has a new drive to store a new collection, I suppose.