Safe L2 only defer-write cache?

lsylsy2 · Post by **lsylsy2** » Thu Oct 27, 2022 2:27 pm

I've had some search in the form but maybe I'm too stupid to find the thread so I'm submitting a new topic.
I'm looking into PrimoCache for my upgraded NAS (Windows + PrimoCache + DrivePool + snapraid). I installed a evaluate version and read the document, and it says:

Since cache index data is stored in memory, the risk of data loss is the same whether using L1 cache or L2 cache. Write-data on the L2 cache cannot be recovered to disk after a failure and restart.

That's a big NO for NAS usage.
However, during the years SSD performance has increased and price has dropped dramatically, and in 2022 even NVMe raid 1 cache is quite affordable for data hoarders (e.g. some small optane sticks). Is it possible to implement a L2-only defer-write cache? there could be wearing on the SSD, however I think it's worth it to increase performance greatly for HDD drives.

Post by **Support** » Fri Oct 28, 2022 8:51 am

lsylsy2 wrote: ↑Thu Oct 27, 2022 2:27 pm Is it possible to implement a L2-only defer-write cache? there could be wearing on the SSD, however I think it's worth it to increase performance greatly for HDD drives.

Indexing will be updated very frequently, so it's not just an SSD wear issue, it also affects performance a lot.

lsylsy2 · Post by **lsylsy2** » Mon Oct 31, 2022 9:19 am

Support wrote: ↑Fri Oct 28, 2022 8:51 am
lsylsy2 wrote: ↑Thu Oct 27, 2022 2:27 pm Is it possible to implement a L2-only defer-write cache? there could be wearing on the SSD, however I think it's worth it to increase performance greatly for HDD drives.
Indexing will be updated very frequently, so it's not just an SSD wear issue, it also affects performance a lot.

Maybe I'm asking question wrongly. I'm not meaning a "save data in L2 only and not L1" cache, but "everything has a copy inside L2, and not depending on L1 in a disaster scenario" cache, the purpose is to recover after a power loss / system crash / etc.
some journaling should do the work and in worst case writing to SSD will be doubled, still acceptable?
the metadata / index written to L2 will not be used normally, so only adding some write, not affecting performance too much, and in a disaster recovery, PrimoCache will scan L2 and recover / rebuild the in memory cache?

Post by **Support** » Mon Oct 31, 2022 9:37 am

PrimoCache is a block-level caching program which it only caches disk sectors and doesn't have any file information. Journaling is used in file systems and PrimoCache is very difficult to adopt such mechanism.

PS. Here "indexing" refers to PrimoCache internal indexing database for caching, not file system indexing.
PS2. We do try to make L2 safe to defer-write. This will definitely extend the usage of defer-write.

Mash · Post by **Mash** » Tue Apr 04, 2023 10:22 pm

Support wrote: ↑Mon Oct 31, 2022 9:37 am PrimoCache is a block-level caching program which it only caches disk sectors and doesn't have any file information. Journaling is used in file systems and PrimoCache is very difficult to adopt such mechanism.

PS. Here "indexing" refers to PrimoCache internal indexing database for caching, not file system indexing.
PS2. We do try to make L2 safe to defer-write. This will definitely extend the usage of defer-write.

I was puzzled to learn that you are not keeping the state of the cache on L2.
What kind of data usage pattern doesn't care about integrity at all?

With SSD, there is little penalty for maintaining integrity, and implementation can be very straightforward.
Tell me, please, where I'm wrong.

1. Reserve sequential space for every block destination address and "block flush state" in L2 (8 bytes per block?).
2. When you need to write a new block in L2 - use a free or LRU flushed block space (sort index to keep LRU order).
3. When you get a flush signal for L2 blocks - flush blocks and index update in L2 before returning that flush is finished.
4. When you flush an L2 block to HDD - update the flush state in L2 asynchronously in a low-priority queue sorted by destination address.
5. When you read from HDD - stop flushing L2 to HDD, write block to L2, and update the L2 index in an async low-priority queue.
6. On shutdown - flush index updates. In case of power loss, you will only do excessive block flushing.
7. On OS start - read the whole index from L2. For 4TB L2 and 128KB blocks - it's just a 256MB file that would read in 50ms.

To have full consistency destination drive should be done as a RAW partition format over standard unformatted volume.
L2 should act as a full-size drive and hide all the logic inside the disk driver.

You don't need journaling.

Let me explain a specific use case:
- 2TB (1.5GB/s) CF cards used for RAW videos (~1Tb/hour)
- 200TB (0.6GB/s) software HDD raid volume for operational storage
- 4TB (5.0GB/s) software raid over M2 SSD drives used for ingress of data from CF and work with recent files

I have to decide when and manually move data from SSD to HDD. The hit/miss ratio on that process is not good.
And if I need it to work on moved files again - I have to copy it back to SSD, while most of the time, it's just a recent thing.

Done like in my proposal, I would expect to see the following:
1. 200TB virtual drive where I create and format partitions
2. 200TB RAW partition HDD raid for data
3. 4TB RAW partition SSD raid for the cache

Do you see any inconsistency?

I expect we will still have a ~10x cost/volume on fast vs. slow storage in the nearest future, so that 2-layer storage would be important.
Certainly, one can go with enterprise NAS with SSD cache implementations, but for mid-size, it's heavy overkill and a latency issue.

I can use tiered Storage Spaces, but the write-back cache is not used there for read caching, and tier rebalancing is not just-in-time.
And SSD caching is only available in Windows Server, so there is no solution for the desktop.

Any chance you will consider moving in that direction?
Thanks.

tverweij · Post by **tverweij** » Fri Apr 07, 2023 12:16 pm

Defer write can never be safe in Promo Cache - if it is safe, it isn't deferred.
The only way to have a safe deferred write is using hardware with a memory backed up defer write controller that will still write the data after a blue screen or power loss.

Post by **Support** » Sat Apr 08, 2023 10:19 am

Mash wrote: ↑Tue Apr 04, 2023 10:22 pm 1. Reserve sequential space for every block destination address and "block flush state" in L2 (8 bytes per block?).
2. When you need to write a new block in L2 - use a free or LRU flushed block space (sort index to keep LRU order).
3. When you get a flush signal for L2 blocks - flush blocks and index update in L2 before returning that flush is finished.
4. When you flush an L2 block to HDD - update the flush state in L2 asynchronously in a low-priority queue sorted by destination address.
5. When you read from HDD - stop flushing L2 to HDD, write block to L2, and update the L2 index in an async low-priority queue.
6. On shutdown - flush index updates. In case of power loss, you will only do excessive block flushing.
7. On OS start - read the whole index from L2. For 4TB L2 and 128KB blocks - it's just a 256MB file that would read in 50ms.

Index data is updated frequently. It's not easy to maintain the integrity of the index and data on a sudden power lost.

mdburkey · Post by **mdburkey** » Wed Jun 28, 2023 10:57 pm

I agree that a "safer," if still not totally safe, L2 deferred write option would be very valuable to have.

If you store the destination address for any L2 cached blocks in L2 itself, then the ability to recover/rebuild after a reboot should be quite possible, especially if implemented as a simple block queue with an index along the queue of what still needs to be written, and a pointer to the next block in the queue (i.e. pre-allocate the next write block so you don't have to re-write the existing block) or preferably, a pointer previous block pointer as well (which would allow for walking the chain to re-discover the entire queue in a recovery situation). You then keep two CRC'ed copies of the index itself stored on the drive, but only update them when you mark blocks as freed for re-use. Worst case scenario, if you are still flushing data during a power loss, but not receiving any new data, then at the next power on, you flush the entire L2 contents to disk in order. If the index is stale, this just means some blocks may get re-written again, as you never mark blocks for re-use until you update the on disk L2 index. If you are in the process of receiving new data and writing it to the L2 cache, and sustain a power loss, then the above scenario still applies and all that gets lost is the in-flight block (which is the exact same thing that would have happened without the L2 cache). And, if the power failure happens at the absolute worst possible time when you are updating the on-disk L2 index, that is why you keep 2 copies, with the CRC being written last, so that at least one should hopefully be valid. In the case where you lose power BETWEEN L2 index writes, both could potentially be valid, but you know the order they are written and choose the first.

Basically, for only a small increase in L2 disk overhead, you can make L2 only deferred write cache be essentially as safe as a standard non-deferred write operation. I have actually implemented almost this exact same scenario before myself for an embedded system where we needed to safely cache writes in a smaller NOR flash before later writing them out to the file system in our larger, but much slower NAND flash.

The key is just making sure you keep a copy of the index in the L2 itself, even if you only update it when you mark blocks free for re-use, as re-writing the same blocks again to your primary storage hurts nothing, in case of a recovery situation.

I have both the home version of Primo Cache and also the server version, and for our server I would REALLY love to be able to safely enable deferred L2 caching of our RAID array and send it to a high speed NVME drive. Yes, it may burn out the NVME drive pretty quickly, but your software should be able to fairly easily probe it for it's SMART data and total write cycles and automatically stop using the cache and send a notice whenever the estimated life hits a user selectable level (say 5%). At the price NVME drives are down to now, even if we were burning through one every 6 months, it would probably be worth it.

Romex Software Forum

Safe L2 only defer-write cache? Topic is solved

Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?

Re: Safe L2 only defer-write cache?