diff options
author | Mauro Carvalho Chehab <mchehab+samsung@kernel.org> | 2019-06-18 16:40:16 -0300 |
---|---|---|
committer | Mauro Carvalho Chehab <mchehab+samsung@kernel.org> | 2019-07-15 09:20:28 -0300 |
commit | 43f6c0787c1781b951d686e8302377fcf85ccb8a (patch) | |
tree | 91ccc899245bb8470f82bcfef80cd641125e46ad /Documentation/mtd | |
parent | bf6b7a742e3f82b3132e149fb17761e84207f9f1 (diff) | |
download | lwn-43f6c0787c1781b951d686e8302377fcf85ccb8a.tar.gz lwn-43f6c0787c1781b951d686e8302377fcf85ccb8a.zip |
docs: mtd: move it to the driver-api book
While I was tempted to move it to admin-guide, as some docs
there are more userspace-faced, there are some very technical
discussions about memory error correction code from the Kernel
implementer's PoV. So, let's place it inside the driver-api
book.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Diffstat (limited to 'Documentation/mtd')
-rw-r--r-- | Documentation/mtd/index.rst | 12 | ||||
-rw-r--r-- | Documentation/mtd/intel-spi.rst | 90 | ||||
-rw-r--r-- | Documentation/mtd/nand_ecc.rst | 763 | ||||
-rw-r--r-- | Documentation/mtd/spi-nor.rst | 66 |
4 files changed, 0 insertions, 931 deletions
diff --git a/Documentation/mtd/index.rst b/Documentation/mtd/index.rst deleted file mode 100644 index 4fdae418ac97..000000000000 --- a/Documentation/mtd/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -:orphan: - -============================== -Memory Technology Device (MTD) -============================== - -.. toctree:: - :maxdepth: 1 - - intel-spi - nand_ecc - spi-nor diff --git a/Documentation/mtd/intel-spi.rst b/Documentation/mtd/intel-spi.rst deleted file mode 100644 index 0e6d9cd5388d..000000000000 --- a/Documentation/mtd/intel-spi.rst +++ /dev/null @@ -1,90 +0,0 @@ -============================== -Upgrading BIOS using intel-spi -============================== - -Many Intel CPUs like Baytrail and Braswell include SPI serial flash host -controller which is used to hold BIOS and other platform specific data. -Since contents of the SPI serial flash is crucial for machine to function, -it is typically protected by different hardware protection mechanisms to -avoid accidental (or on purpose) overwrite of the content. - -Not all manufacturers protect the SPI serial flash, mainly because it -allows upgrading the BIOS image directly from an OS. - -The intel-spi driver makes it possible to read and write the SPI serial -flash, if certain protection bits are not set and locked. If it finds -any of them set, the whole MTD device is made read-only to prevent -partial overwrites. By default the driver exposes SPI serial flash -contents as read-only but it can be changed from kernel command line, -passing "intel-spi.writeable=1". - -Please keep in mind that overwriting the BIOS image on SPI serial flash -might render the machine unbootable and requires special equipment like -Dediprog to revive. You have been warned! - -Below are the steps how to upgrade MinnowBoard MAX BIOS directly from -Linux. - - 1) Download and extract the latest Minnowboard MAX BIOS SPI image - [1]. At the time writing this the latest image is v92. - - 2) Install mtd-utils package [2]. We need this in order to erase the SPI - serial flash. Distros like Debian and Fedora have this prepackaged with - name "mtd-utils". - - 3) Add "intel-spi.writeable=1" to the kernel command line and reboot - the board (you can also reload the driver passing "writeable=1" as - module parameter to modprobe). - - 4) Once the board is up and running again, find the right MTD partition - (it is named as "BIOS"):: - - # cat /proc/mtd - dev: size erasesize name - mtd0: 00800000 00001000 "BIOS" - - So here it will be /dev/mtd0 but it may vary. - - 5) Make backup of the existing image first:: - - # dd if=/dev/mtd0ro of=bios.bak - 16384+0 records in - 16384+0 records out - 8388608 bytes (8.4 MB) copied, 10.0269 s, 837 kB/s - - 6) Verify the backup: - - # sha1sum /dev/mtd0ro bios.bak - fdbb011920572ca6c991377c4b418a0502668b73 /dev/mtd0ro - fdbb011920572ca6c991377c4b418a0502668b73 bios.bak - - The SHA1 sums must match. Otherwise do not continue any further! - - 7) Erase the SPI serial flash. After this step, do not reboot the - board! Otherwise it will not start anymore:: - - # flash_erase /dev/mtd0 0 0 - Erasing 4 Kibyte @ 7ff000 -- 100 % complete - - 8) Once completed without errors you can write the new BIOS image: - - # dd if=MNW2MAX1.X64.0092.R01.1605221712.bin of=/dev/mtd0 - - 9) Verify that the new content of the SPI serial flash matches the new - BIOS image:: - - # sha1sum /dev/mtd0ro MNW2MAX1.X64.0092.R01.1605221712.bin - 9b4df9e4be2057fceec3a5529ec3d950836c87a2 /dev/mtd0ro - 9b4df9e4be2057fceec3a5529ec3d950836c87a2 MNW2MAX1.X64.0092.R01.1605221712.bin - - The SHA1 sums should match. - - 10) Now you can reboot your board and observe the new BIOS starting up - properly. - -References ----------- - -[1] https://firmware.intel.com/sites/default/files/MinnowBoard%2EMAX_%2EX64%2E92%2ER01%2Ezip - -[2] http://www.linux-mtd.infradead.org/ diff --git a/Documentation/mtd/nand_ecc.rst b/Documentation/mtd/nand_ecc.rst deleted file mode 100644 index e8d3c53a5056..000000000000 --- a/Documentation/mtd/nand_ecc.rst +++ /dev/null @@ -1,763 +0,0 @@ -========================== -NAND Error-correction Code -========================== - -Introduction -============ - -Having looked at the linux mtd/nand driver and more specific at nand_ecc.c -I felt there was room for optimisation. I bashed the code for a few hours -performing tricks like table lookup removing superfluous code etc. -After that the speed was increased by 35-40%. -Still I was not too happy as I felt there was additional room for improvement. - -Bad! I was hooked. -I decided to annotate my steps in this file. Perhaps it is useful to someone -or someone learns something from it. - - -The problem -=========== - -NAND flash (at least SLC one) typically has sectors of 256 bytes. -However NAND flash is not extremely reliable so some error detection -(and sometimes correction) is needed. - -This is done by means of a Hamming code. I'll try to explain it in -laymans terms (and apologies to all the pro's in the field in case I do -not use the right terminology, my coding theory class was almost 30 -years ago, and I must admit it was not one of my favourites). - -As I said before the ecc calculation is performed on sectors of 256 -bytes. This is done by calculating several parity bits over the rows and -columns. The parity used is even parity which means that the parity bit = 1 -if the data over which the parity is calculated is 1 and the parity bit = 0 -if the data over which the parity is calculated is 0. So the total -number of bits over the data over which the parity is calculated + the -parity bit is even. (see wikipedia if you can't follow this). -Parity is often calculated by means of an exclusive or operation, -sometimes also referred to as xor. In C the operator for xor is ^ - -Back to ecc. -Let's give a small figure: - -========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== -byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 -byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 -byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 -byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 -byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 -... -byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 -byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 - cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 - cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 - cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 -========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== - -This figure represents a sector of 256 bytes. -cp is my abbreviation for column parity, rp for row parity. - -Let's start to explain column parity. - -- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. - - so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. - -Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. - -- cp2 is the parity over bit0, bit1, bit4 and bit5 -- cp3 is the parity over bit2, bit3, bit6 and bit7. -- cp4 is the parity over bit0, bit1, bit2 and bit3. -- cp5 is the parity over bit4, bit5, bit6 and bit7. - -Note that each of cp0 .. cp5 is exactly one bit. - -Row parity actually works almost the same. - -- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) -- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) -- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... - (so handle two bytes, then skip 2 bytes). -- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) -- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. - - so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) -- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. - -The story now becomes quite boring. I guess you get the idea. - -- rp6 covers 8 bytes then skips 8 etc -- rp7 skips 8 bytes then covers 8 etc -- rp8 covers 16 bytes then skips 16 etc -- rp9 skips 16 bytes then covers 16 etc -- rp10 covers 32 bytes then skips 32 etc -- rp11 skips 32 bytes then covers 32 etc -- rp12 covers 64 bytes then skips 64 etc -- rp13 skips 64 bytes then covers 64 etc -- rp14 covers 128 bytes then skips 128 -- rp15 skips 128 bytes then covers 128 - -In the end the parity bits are grouped together in three bytes as -follows: - -===== ===== ===== ===== ===== ===== ===== ===== ===== -ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 -===== ===== ===== ===== ===== ===== ===== ===== ===== -ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 -ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 -ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 -===== ===== ===== ===== ===== ===== ===== ===== ===== - -I detected after writing this that ST application note AN1823 -(http://www.st.com/stonline/) gives a much -nicer picture.(but they use line parity as term where I use row parity) -Oh well, I'm graphically challenged, so suffer with me for a moment :-) - -And I could not reuse the ST picture anyway for copyright reasons. - - -Attempt 0 -========= - -Implementing the parity calculation is pretty simple. -In C pseudocode:: - - for (i = 0; i < 256; i++) - { - if (i & 0x01) - rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; - else - rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0; - if (i & 0x02) - rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; - else - rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; - if (i & 0x04) - rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; - else - rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; - if (i & 0x08) - rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; - else - rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; - if (i & 0x10) - rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; - else - rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; - if (i & 0x20) - rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; - else - rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; - if (i & 0x40) - rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; - else - rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; - if (i & 0x80) - rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; - else - rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; - cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; - cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; - cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; - cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 - cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 - cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 - } - - -Analysis 0 -========== - -C does have bitwise operators but not really operators to do the above -efficiently (and most hardware has no such instructions either). -Therefore without implementing this it was clear that the code above was -not going to bring me a Nobel prize :-) - -Fortunately the exclusive or operation is commutative, so we can combine -the values in any order. So instead of calculating all the bits -individually, let us try to rearrange things. -For the column parity this is easy. We can just xor the bytes and in the -end filter out the relevant bits. This is pretty nice as it will bring -all cp calculation out of the for loop. - -Similarly we can first xor the bytes for the various rows. -This leads to: - - -Attempt 1 -========= - -:: - - const char parity[256] = { - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 - }; - - void ecc1(const unsigned char *buf, unsigned char *code) - { - int i; - const unsigned char *bp = buf; - unsigned char cur; - unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; - unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; - unsigned char par; - - par = 0; - rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; - rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; - rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; - rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; - - for (i = 0; i < 256; i++) - { - cur = *bp++; - par ^= cur; - if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; - if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; - if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; - if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; - if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; - if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; - if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; - } - code[0] = - (parity[rp7] << 7) | - (parity[rp6] << 6) | - (parity[rp5] << 5) | - (parity[rp4] << 4) | - (parity[rp3] << 3) | - (parity[rp2] << 2) | - (parity[rp1] << 1) | - (parity[rp0]); - code[1] = - (parity[rp15] << 7) | - (parity[rp14] << 6) | - (parity[rp13] << 5) | - (parity[rp12] << 4) | - (parity[rp11] << 3) | - (parity[rp10] << 2) | - (parity[rp9] << 1) | - (parity[rp8]); - code[2] = - (parity[par & 0xf0] << 7) | - (parity[par & 0x0f] << 6) | - (parity[par & 0xcc] << 5) | - (parity[par & 0x33] << 4) | - (parity[par & 0xaa] << 3) | - (parity[par & 0x55] << 2); - code[0] = ~code[0]; - code[1] = ~code[1]; - code[2] = ~code[2]; - } - -Still pretty straightforward. The last three invert statements are there to -give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash -all data is 0xff, so the checksum then matches. - -I also introduced the parity lookup. I expected this to be the fastest -way to calculate the parity, but I will investigate alternatives later -on. - - -Analysis 1 -========== - -The code works, but is not terribly efficient. On my system it took -almost 4 times as much time as the linux driver code. But hey, if it was -*that* easy this would have been done long before. -No pain. no gain. - -Fortunately there is plenty of room for improvement. - -In step 1 we moved from bit-wise calculation to byte-wise calculation. -However in C we can also use the unsigned long data type and virtually -every modern microprocessor supports 32 bit operations, so why not try -to write our code in such a way that we process data in 32 bit chunks. - -Of course this means some modification as the row parity is byte by -byte. A quick analysis: -for the column parity we use the par variable. When extending to 32 bits -we can in the end easily calculate rp0 and rp1 from it. -(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 -respectively, from MSB to LSB) -also rp2 and rp3 can be easily retrieved from par as rp3 covers the -first two MSBs and rp2 covers the last two LSBs. - -Note that of course now the loop is executed only 64 times (256/4). -And note that care must taken wrt byte ordering. The way bytes are -ordered in a long is machine dependent, and might affect us. -Anyway, if there is an issue: this code is developed on x86 (to be -precise: a DELL PC with a D920 Intel CPU) - -And of course the performance might depend on alignment, but I expect -that the I/O buffers in the nand driver are aligned properly (and -otherwise that should be fixed to get maximum performance). - -Let's give it a try... - - -Attempt 2 -========= - -:: - - extern const char parity[256]; - - void ecc2(const unsigned char *buf, unsigned char *code) - { - int i; - const unsigned long *bp = (unsigned long *)buf; - unsigned long cur; - unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; - unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; - unsigned long par; - - par = 0; - rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; - rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; - rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; - rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; - - for (i = 0; i < 64; i++) - { - cur = *bp++; - par ^= cur; - if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; - if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; - if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; - if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; - if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; - } - /* - we need to adapt the code generation for the fact that rp vars are now - long; also the column parity calculation needs to be changed. - we'll bring rp4 to 15 back to single byte entities by shifting and - xoring - */ - rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; - rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; - rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; - rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; - rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; - rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; - rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; - rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; - rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; - rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; - rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; - rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; - rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; - rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; - par ^= (par >> 16); - rp1 = (par >> 8); rp1 &= 0xff; - rp0 = (par & 0xff); - par ^= (par >> 8); par &= 0xff; - - code[0] = - (parity[rp7] << 7) | - (parity[rp6] << 6) | - (parity[rp5] << 5) | - (parity[rp4] << 4) | - (parity[rp3] << 3) | - (parity[rp2] << 2) | - (parity[rp1] << 1) | - (parity[rp0]); - code[1] = - (parity[rp15] << 7) | - (parity[rp14] << 6) | - (parity[rp13] << 5) | - (parity[rp12] << 4) | - (parity[rp11] << 3) | - (parity[rp10] << 2) | - (parity[rp9] << 1) | - (parity[rp8]); - code[2] = - (parity[par & 0xf0] << 7) | - (parity[par & 0x0f] << 6) | - (parity[par & 0xcc] << 5) | - (parity[par & 0x33] << 4) | - (parity[par & 0xaa] << 3) | - (parity[par & 0x55] << 2); - code[0] = ~code[0]; - code[1] = ~code[1]; - code[2] = ~code[2]; - } - -The parity array is not shown any more. Note also that for these -examples I kinda deviated from my regular programming style by allowing -multiple statements on a line, not using { } in then and else blocks -with only a single statement and by using operators like ^= - - -Analysis 2 -========== - -The code (of course) works, and hurray: we are a little bit faster than -the linux driver code (about 15%). But wait, don't cheer too quickly. -There is more to be gained. -If we look at e.g. rp14 and rp15 we see that we either xor our data with -rp14 or with rp15. However we also have par which goes over all data. -This means there is no need to calculate rp14 as it can be calculated from -rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15; -(or if desired we can avoid calculating rp15 and calculate it from -rp14). That is why some places refer to inverse parity. -Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. -Effectively this means we can eliminate the else clause from the if -statements. Also we can optimise the calculation in the end a little bit -by going from long to byte first. Actually we can even avoid the table -lookups - -Attempt 3 -========= - -Odd replaced:: - - if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; - if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; - if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; - if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; - if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; - -with:: - - if (i & 0x01) rp5 ^= cur; - if (i & 0x02) rp7 ^= cur; - if (i & 0x04) rp9 ^= cur; - if (i & 0x08) rp11 ^= cur; - if (i & 0x10) rp13 ^= cur; - if (i & 0x20) rp15 ^= cur; - -and outside the loop added:: - - rp4 = par ^ rp5; - rp6 = par ^ rp7; - rp8 = par ^ rp9; - rp10 = par ^ rp11; - rp12 = par ^ rp13; - rp14 = par ^ rp15; - -And after that the code takes about 30% more time, although the number of -statements is reduced. This is also reflected in the assembly code. - - -Analysis 3 -========== - -Very weird. Guess it has to do with caching or instruction parallellism -or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting -observation was that this one is only 30% slower (according to time) -executing the code as my 3Ghz D920 processor. - -Well, it was expected not to be easy so maybe instead move to a -different track: let's move back to the code from attempt2 and do some -loop unrolling. This will eliminate a few if statements. I'll try -different amounts of unrolling to see what works best. - - -Attempt 4 -========= - -Unrolled the loop 1, 2, 3 and 4 times. -For 4 the code starts with:: - - for (i = 0; i < 4; i++) - { - cur = *bp++; - par ^= cur; - rp4 ^= cur; - rp6 ^= cur; - rp8 ^= cur; - rp10 ^= cur; - if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; - cur = *bp++; - par ^= cur; - rp5 ^= cur; - rp6 ^= cur; - ... - - -Analysis 4 -========== - -Unrolling once gains about 15% - -Unrolling twice keeps the gain at about 15% - -Unrolling three times gives a gain of 30% compared to attempt 2. - -Unrolling four times gives a marginal improvement compared to unrolling -three times. - -I decided to proceed with a four time unrolled loop anyway. It was my gut -feeling that in the next steps I would obtain additional gain from it. - -The next step was triggered by the fact that par contains the xor of all -bytes and rp4 and rp5 each contain the xor of half of the bytes. -So in effect par = rp4 ^ rp5. But as xor is commutative we can also say -that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can -eliminate rp5 (or rp4, but I already foresaw another optimisation). -The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. - - -Attempt 5 -========= - -Effectively so all odd digit rp assignments in the loop were removed. -This included the else clause of the if statements. -Of course after the loop we need to correct things by adding code like:: - - rp5 = par ^ rp4; - -Also the initial assignments (rp5 = 0; etc) could be removed. -Along the line I also removed the initialisation of rp0/1/2/3. - - -Analysis 5 -========== - -Measurements showed this was a good move. The run-time roughly halved -compared with attempt 4 with 4 times unrolled, and we only require 1/3rd -of the processor time compared to the current code in the linux kernel. - -However, still I thought there was more. I didn't like all the if -statements. Why not keep a running parity and only keep the last if -statement. Time for yet another version! - - -Attempt 6 -========= - -THe code within the for loop was changed to:: - - for (i = 0; i < 4; i++) - { - cur = *bp++; tmppar = cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; - - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; - - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; - cur = *bp++; tmppar ^= cur; rp8 ^= cur; - - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; - - par ^= tmppar; - if ((i & 0x1) == 0) rp12 ^= tmppar; - if ((i & 0x2) == 0) rp14 ^= tmppar; - } - -As you can see tmppar is used to accumulate the parity within a for -iteration. In the last 3 statements is added to par and, if needed, -to rp12 and rp14. - -While making the changes I also found that I could exploit that tmppar -contains the running parity for this iteration. So instead of having: -rp4 ^= cur; rp6 ^= cur; -I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next -statement. A similar change was done for rp8 and rp10 - - -Analysis 6 -========== - -Measuring this code again showed big gain. When executing the original -linux code 1 million times, this took about 1 second on my system. -(using time to measure the performance). After this iteration I was back -to 0.075 sec. Actually I had to decide to start measuring over 10 -million iterations in order not to lose too much accuracy. This one -definitely seemed to be the jackpot! - -There is a little bit more room for improvement though. There are three -places with statements:: - - rp4 ^= cur; rp6 ^= cur; - -It seems more efficient to also maintain a variable rp4_6 in the while -loop; This eliminates 3 statements per loop. Of course after the loop we -need to correct by adding:: - - rp4 ^= rp4_6; - rp6 ^= rp4_6 - -Furthermore there are 4 sequential assignments to rp8. This can be -encoded slightly more efficiently by saving tmppar before those 4 lines -and later do rp8 = rp8 ^ tmppar ^ notrp8; -(where notrp8 is the value of rp8 before those 4 lines). -Again a use of the commutative property of xor. -Time for a new test! - - -Attempt 7 -========= - -The new code now looks like:: - - for (i = 0; i < 4; i++) - { - cur = *bp++; tmppar = cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; - - cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; - - notrp8 = tmppar; - cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; - rp8 = rp8 ^ tmppar ^ notrp8; - - cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; - - par ^= tmppar; - if ((i & 0x1) == 0) rp12 ^= tmppar; - if ((i & 0x2) == 0) rp14 ^= tmppar; - } - rp4 ^= rp4_6; - rp6 ^= rp4_6; - - -Not a big change, but every penny counts :-) - - -Analysis 7 -========== - -Actually this made things worse. Not very much, but I don't want to move -into the wrong direction. Maybe something to investigate later. Could -have to do with caching again. - -Guess that is what there is to win within the loop. Maybe unrolling one -more time will help. I'll keep the optimisations from 7 for now. - - -Attempt 8 -========= - -Unrolled the loop one more time. - - -Analysis 8 -========== - -This makes things worse. Let's stick with attempt 6 and continue from there. -Although it seems that the code within the loop cannot be optimised -further there is still room to optimize the generation of the ecc codes. -We can simply calculate the total parity. If this is 0 then rp4 = rp5 -etc. If the parity is 1, then rp4 = !rp5; - -But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits -in the result byte and then do something like:: - - code[0] |= (code[0] << 1); - -Lets test this. - - -Attempt 9 -========= - -Changed the code but again this slightly degrades performance. Tried all -kind of other things, like having dedicated parity arrays to avoid the -shift after parity[rp7] << 7; No gain. -Change the lookup using the parity array by using shift operators (e.g. -replace parity[rp7] << 7 with:: - - rp7 ^= (rp7 << 4); - rp7 ^= (rp7 << 2); - rp7 ^= (rp7 << 1); - rp7 &= 0x80; - -No gain. - -The only marginal change was inverting the parity bits, so we can remove -the last three invert statements. - -Ah well, pity this does not deliver more. Then again 10 million -iterations using the linux driver code takes between 13 and 13.5 -seconds, whereas my code now takes about 0.73 seconds for those 10 -million iterations. So basically I've improved the performance by a -factor 18 on my system. Not that bad. Of course on different hardware -you will get different results. No warranties! - -But of course there is no such thing as a free lunch. The codesize almost -tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. - - -Correcting errors -================= - -For correcting errors I again used the ST application note as a starter, -but I also peeked at the existing code. - -The algorithm itself is pretty straightforward. Just xor the given and -the calculated ecc. If all bytes are 0 there is no problem. If 11 bits -are 1 we have one correctable bit error. If there is 1 bit 1, we have an -error in the given ecc code. - -It proved to be fastest to do some table lookups. Performance gain -introduced by this is about a factor 2 on my system when a repair had to -be done, and 1% or so if no repair had to be done. - -Code size increased from 330 bytes to 686 bytes for this function. -(gcc 4.2, -O3) - - -Conclusion -========== - -The gain when calculating the ecc is tremendous. Om my development hardware -a speedup of a factor of 18 for ecc calculation was achieved. On a test on an -embedded system with a MIPS core a factor 7 was obtained. - -On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor -5 (big endian mode, gcc 4.1.2, -O3) - -For correction not much gain could be obtained (as bitflips are rare). Then -again there are also much less cycles spent there. - -It seems there is not much more gain possible in this, at least when -programmed in C. Of course it might be possible to squeeze something more -out of it with an assembler program, but due to pipeline behaviour etc -this is very tricky (at least for intel hw). - -Author: Frans Meulenbroeks - -Copyright (C) 2008 Koninklijke Philips Electronics NV. diff --git a/Documentation/mtd/spi-nor.rst b/Documentation/mtd/spi-nor.rst deleted file mode 100644 index f5333e3bf486..000000000000 --- a/Documentation/mtd/spi-nor.rst +++ /dev/null @@ -1,66 +0,0 @@ -================= -SPI NOR framework -================= - -Part I - Why do we need this framework? ---------------------------------------- - -SPI bus controllers (drivers/spi/) only deal with streams of bytes; the bus -controller operates agnostic of the specific device attached. However, some -controllers (such as Freescale's QuadSPI controller) cannot easily handle -arbitrary streams of bytes, but rather are designed specifically for SPI NOR. - -In particular, Freescale's QuadSPI controller must know the NOR commands to -find the right LUT sequence. Unfortunately, the SPI subsystem has no notion of -opcodes, addresses, or data payloads; a SPI controller simply knows to send or -receive bytes (Tx and Rx). Therefore, we must define a new layering scheme under -which the controller driver is aware of the opcodes, addressing, and other -details of the SPI NOR protocol. - -Part II - How does the framework work? --------------------------------------- - -This framework just adds a new layer between the MTD and the SPI bus driver. -With this new layer, the SPI NOR controller driver does not depend on the -m25p80 code anymore. - -Before this framework, the layer is like:: - - MTD - ------------------------ - m25p80 - ------------------------ - SPI bus driver - ------------------------ - SPI NOR chip - - After this framework, the layer is like: - MTD - ------------------------ - SPI NOR framework - ------------------------ - m25p80 - ------------------------ - SPI bus driver - ------------------------ - SPI NOR chip - - With the SPI NOR controller driver (Freescale QuadSPI), it looks like: - MTD - ------------------------ - SPI NOR framework - ------------------------ - fsl-quadSPI - ------------------------ - SPI NOR chip - -Part III - How can drivers use the framework? ---------------------------------------------- - -The main API is spi_nor_scan(). Before you call the hook, a driver should -initialize the necessary fields for spi_nor{}. Please see -drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to fsl-quadspi.c -when you want to write a new driver for a SPI NOR controller. -Another API is spi_nor_restore(), this is used to restore the status of SPI -flash chip such as addressing mode. Call it whenever detach the driver from -device or reboot the system. |