L2 cache corrupted 0.99 Topic is solved

Found a bug? Report here
Yozzer
Level 4
Level 4
Posts: 34
Joined: Sun Aug 25, 2013 11:12 am

L2 cache corrupted 0.99

Post by Yozzer »

0.99 has been working fine for me, at least up to a week ago. I play a strategy game called Warband a few hours each evening, but a week ago I started to get random crashes to desktop. I tried rolling back the video drivers, (the only change in recent weeks), but in continued crashing.
I had two drives cached, 6gb ram L1, and 12gb L2 using a 24gb partition on a Samsung 840 pro for the L2 split, which have been running OK for weeks since the update from 0.98, so I disabled both to see if it helped.
There has been no crashes now for two evenings, (it was happening 2 or 3 times a session over the previous 4 evenings), so I suspect somehow the L2 cache had become corrupted. I did have write defer for 10 seconds only on the data drive.
The system drive contained the program, (1tb), and the data drive the save file, (3tb), but as it has worked OK since disabling the cache, I think the saves were OK. I will set up the cache again this evening and test again. It may have been a one off, but I will not update or make any changes and see how it runs, plus log any other unusual events.
Is there a way I can check the L2 cache for integrity should this happen again?
User avatar
Support
Support Team
Support Team
Posts: 3627
Joined: Sun Dec 21, 2008 2:42 am

Re: L2 cache corrupted 0.99

Post by Support »

Yozzer wrote:Is there a way I can check the L2 cache for integrity should this happen again?
I'm sorry that so far there's no way to check the L2 cache integrity. You may simply reset cache content if you suspect the cache content is out of sync.
Yozzer
Level 4
Level 4
Posts: 34
Joined: Sun Aug 25, 2013 11:12 am

Re: L2 cache corrupted 0.99

Post by Yozzer »

I reset both cache and reconfigured OK, but left the write defer off for the moment, I do not think that was the problem, but I will see how it goes for a few days. I certainly missed having PrimoCache running, it is getting better with each release.
Bjameson
Level 6
Level 6
Posts: 62
Joined: Mon Nov 08, 2010 12:00 pm

Re: L2 cache corrupted 0.99

Post by Bjameson »

No way to check L2's integrity? No checksum over the blocks stored? Can't be true. If it is true, stop using L2 immediately. Because L2 can be corrupted by other processes, either in memory or on disk. Do not rely on the Windows- or CPU locking mechanisms to protect L2 from (inadvertent) tampering. "Simply resetting the cache content" cannot be a serious advice to a commercial customer.

It doesn't matter how much rewriting it takes - if you want to sell this product for commercial use, you must be able to detect memory/disk/controller faults. You could even detect cable faults if you want to. Interrogate the disk controller for errors and the SMART values for excessive retries and write a checksum for each block or group of blocks.
piquadrat
Level 4
Level 4
Posts: 26
Joined: Wed Jan 22, 2014 7:41 am

Re: L2 cache corrupted 0.99

Post by piquadrat »

Redundant Arrays (RAID) include verification subroutines only as optional maintenance procedure. It is not build as a part of the storage system per se. Too much performance penalty. Besides if inconsistency occurs how one is able to tell, which source is corrupted (cache or drive)?
Bjameson
Level 6
Level 6
Posts: 62
Joined: Mon Nov 08, 2010 12:00 pm

Re: L2 cache corrupted 0.99

Post by Bjameson »

The idea is to prevent users from pointing fingers at Romex.

The cache needs to make sure there were no read errors in the first place. So you watch for any errors reported by the disk controller. If SMART reports excessive retries then you know something -may- be wrong with the disk system. Being the cache, your safety precaution should be to shut yourself down and logging your action in the Windows Event Log or to show it in the Primocache GUI.

On top of that the cache should keep a checksum over the L2 data stored. When the cache reads back the same data later and finds a checksum mismatch, it knows that "something" has caused corruption.

In all cases the cache should shut itself down at the first sign of trouble.
This not only protects data but it also protects Romex against lawsuits over damages caused by corruption because Romex can prove it has taken reasonable precautions.

... Just trying to think constructively.
User avatar
Support
Support Team
Support Team
Posts: 3627
Joined: Sun Dec 21, 2008 2:42 am

Re: L2 cache corrupted 0.99

Post by Support »

@Bjameson,

The issue is to compare the data from L2 storage with the data from the source disk to make sure the L2 cache is not out of sync. As piquadrat said, this process will cause too much performance penalty because you have to read the source disk.
Bjameson
Level 6
Level 6
Posts: 62
Joined: Mon Nov 08, 2010 12:00 pm

Re: L2 cache corrupted 0.99

Post by Bjameson »

@Support: Thank you for your explanation. It took me some time to figure this out but now I finally understand.
Indeed you can never tell for sure exactly where the corruption originated from.
This also means that all other caches have the same problem and there's really nothing anyone can do about it.

Thanks, keep up the good work!
Yozzer
Level 4
Level 4
Posts: 34
Joined: Sun Aug 25, 2013 11:12 am

Re: L2 cache corrupted 0.99

Post by Yozzer »

The posts here have been very interesting, but with regards to knowing the cause of the corruption, I would have thought that if a system runs OK without a cache, then a cache (L1) is introduced, and it still runs OK, the a L2 cache is introduced and it fails multiple times until the L2 cache is removed, would that not be proof the L2 cache is causing a problem, even maybe the SSD at fault?
That is the methodology I have adopted to narrow down the source of the problem, so input would be appreciated in case my logic is flawed.
I did follow this approach, and it is still failing within an 30 minutes to 1 hour after re introducing the L2 cache playing the same game. I wondered if for some reason the L2 cache was not flushing properly, leaving corrupted data in it which was still being used?!
Anyway, I am currently running clean with just L1 cache, and plan to use a different SSD for L2 tomorrow, though the SSD I did use first tests fine. If it works OK with the new SSD, I will completely reformat the original SSD after deleting the partition and try again. All very odd as it has fine until now for some time. I will post back with my findings, or disable L2 cache each time I run the game, (kidding) .
User avatar
Support
Support Team
Support Team
Posts: 3627
Joined: Sun Dec 21, 2008 2:42 am

Re: L2 cache corrupted 0.99

Post by Support »

@Yozzer,

Thanks. Of course your logic is correct. I'm looking forward to your findings.
Post Reply