BCH Decode
BCH Decode
Parhi Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN 55455 USA
ABSTRACT Long BCH codes achieve additional coding gain of around 0.6dB compared to Reed-Solomon codes with similar code rate used for long-haul optical communication systems. For our considered parallel decoder architecture, a novel group matching scheme is proposed to reduce the overall hardware complexity of both Chien search and syndrome generator units by 46% for BCH(2047, 1926, 23) code as opposed to only 22% if directly applying the iterative matching algorithm. The proposed scheme exploits the substructure sharing within a nite eld multiplier (FFM) and among groups of FFMs. 1. INTRODUCTION Forward-error correction codes used in long-haul optical communication systems should provide signicant coding gains (error oor can only occur at much lower bit error rate (BER), such as 1015 ) with high code rate and moderate complexity. In International Telecommunication Union (ITU-T) G.975, the (255, 239) Reed-Solomon (RS) code has been standardized to resist burst errors for optical ber submarine cable systems [1]. With only 7% overhead, this RS code can not only provide approximately 5.5dB coding gain at the BER of 1012 for random errors correction, but also correct bursts of length up to 64 bit [2]. BCH and RS codes form the core of the most powerful known algebraic codes and are widely used [3]. From our simulation using hard decision errors-only decoding under AWGN channel, additional coding gain of approximately 0.6dB is observed for binary BCH codes compared to RS codes with similar code rate and codeword length. Hence, BCH code and its decoder architecture are of great interest. To increase the decoding throughput, a parallel decoder is derived by developing parallel architectures for various building blocks. Among the three major building blocks in the syndromebased BCH decoder, i.e., syndrome generator unit, key equation solver and the Chien search, the parallel Chien search block is the most area consuming unit according to [2]. It occupies more than 65% of logic core for both 10- and 40-Gb/s forward error correction devices. Therefore, how to develop an area efcient parallel Chien search circuit for high throughput BCH decoders is of great interest and is considered in this paper. Then the area efcient scheme is applied to syndrome generator unit as well. This paper is organized as follows. In Section 2, the decoding performance of long BCH codes is presented and compared
This work was supported by the Army Research Of ce under grant number DA/DAAD19-01-1-0705.
to RS codes. In Section 3, we briey review the implementation of three major building blocks of the BCH decoder. Section 4 is devoted to the area efcient schemes to signicantly reduce the complexity of parallel Chien search architecture as well as syndrome generator units. Section 5 provides the conclusions. 2. HIGH RATE BCH CODES VERSUS REED-SOLOMON CODES For the purpose of performance comparison, the BCH and RS codes with similar code rate as listed in Table 1 are considered. Here code parameters (n, k, d) represent codeword length, information length and minimum distance, respectively. Table 1. Considered high-rate long BCH and RS codes
code parameters code SNR (dB) rate at the BER of 105 (n, k, d) BCH(2047,1926,23) 0.941 6.4 BCH(8191,7684,79) 0.938 RS(255,239,17) RS(1023,959,65) 0.937 0.937 6.0 7.0 6.6
Under AWGN channel using BPSK and hard decision errorsonly decoding, the performance curves for the considered high rate codes are depicted in Fig. 1.
10
0
10
RS(255, 239, 17) RS(1023, 959, 65) BCH(2047, 1926, 23) BCH(8191, 7684, 79)
10
10
10
10
4.5
5.5 Eb/N0(dB)
6.5
7.5
V - 73
ICASSP 2004
From Fig. 1, it is easily observed that BCH codes achieve slightly better performance compared to RS codes with similar code rate and codeword length. An additional coding gain of 0.6dB at the BER of 105 is achieved for BCH(2047, 1926, 23) compared to RS(255, 239, 17) code. Likewise, 0.6dB coding gain is also seen for BCH(8191, 7684, 79) compared to RS(1023, 959, 65). Due to its better decoding performance, only BCH decoder design will be considered in later sections. 3. BCH DECODER ARCHITECTURE In this section, a parallel BCH decoder is presented. The syndrome-based BCH decoding consists of three major steps [3], as depicted in Fig. 2, where R is the hard decision of received information from noisy channel and D is the decoded codeword. S and represent syndromes of the received polynomial and error locator polynomial, respectively.
R syndrome generator S key equation solver Chien search Error correction D
x i D (a) x pi D Si Si
(p1)i x
x 2i i x (b)
Fig. 3. Syndrome generator unit (a) Conventional architecture (b) Parallel architecture with parallel factor of p
FIFO Buffer
normally achieved by Chien search. A conventional serial Chien search architecture is shown in Fig. 4, and (i ) =
t X j=0
j ij =
t X j=1
j ij + 1
(2)
3.1. Syndrome Generator For t-error-correcting BCH codes, 2t syndromes of the received polynomial could be evaluated as follows: Sj = R(j ) =
n1 X i=0
where 0 i (n 1). All the multiplexers select (x) in the rst clock cycle, then select the registered data afterwards.
( i )
Ri (j )i
(1)
MUX 1
2 MUX 2
MUX t
for 1 j 2t. If 2t conventional syndrome generator units shown in Fig. 3(a) are used at the same time independently, n clock cycles are necessary to complete computing all the 2t syndromes. However, if each syndrome generator unit in Fig. 3(a) is replaced by a parallel syndrome generator unit with parallel factor of p depicted in Fig. 3(b), which can process p bits per clock cycle, only n/p clock cycles are sufcient. It is worth noting that for binary BCH codes, even-indexed syndromes are the squares of earlier-indexed syndromes, i.e., 2 S2j = Sj . Based on this constraint, actually only t parallel syndrome generator units are required to compute the odd-indexed syndromes, followed by a much simpler eld square circuit to generate those even-indexed syndromes. 3.2. Key Equation Solver Either Petersons or Berlekamp-Massey (BM) algorithm [3] could be employed to solve the key equations for (x). Inversion-free BM algorithm and its efcient implementations could be easily found in the literature [2] [4] and are not considered in this paper. 3.3. Chien Search Once (x) is found, the decoder searches for error locations by checking whether (i ) = 0 for 0 i (n 1), which is
Since all the n possible locations have to be evaluated for the (x), it takes n clock cycles to complete the Chien search process. To speed up this process, parallel Chien search architecture that evaluates several locations per clock cycle is essential. Two different possible architectures with parallel factor p are depicted in Fig. 5(a) [2] and Fig. 5(b) [4], where Fig. 5(a) actually is just a direct unfolded version of Fig. 4 with an unfolding factor of p. As both designs in Fig. 5 can reduce the number of clock cycles searching for error locations from n down to n/p , they also share similar hardware complexity. Denoting the parallel factor as p, both designs have the exactly same (p t) constant nite eld multipliers (FFM), p t-input m-bit nite eld adders (FFA), p m-bit registers and p m-bit multiplexers. However, the critical path of Fig. 5(a) is (T mux + p T m + T a) while it is only (T mux + T m + T a) for Fig. 5(b), where T mux, T m and T a stand for the critical path of multiplexer, FFM and t-input m-bit FFA, respectively. Obviously, once the parallel factor p is greater than 1, much faster clock speed could be achieved for the design in Fig. 5(b) than that in Fig. 5(a). For example, assuming Tm is dominant, critical path of Fig. 5(b) is p times shorter.
V - 74
pi ( )
2 t
Since all the elements i+j in the matrix coef f , where 0 l l, j (m 1), are simple binary elements and all the additions are modulo 2 operations, the computational complexity of FFM can also be dened by the number of XOR gates. Obviously, while computing all the m coefcients of the product P , p0 , p1 , . . . , pm1 , all of which are linear combinations of B coefcients, there are many redundant modulo 2 additions that allow a reduction of the number of operations. Different from the algorithm in [5], our iterative matching algorithm based on [6] consists of following four basic steps. 1. Determine the number of bit-wise matches (nonzero bits) between all of the rows in the binary matrix coef f ; 2. Choose the best match; 3. Eliminate the redundancy from the best match; Return the remainders to the two rows that contribute the best match; Append an additional row at the bottom of the binary matrix to hold the redundancy; 4. Repeat steps 1-3 for all the rows in the binary matrix including the appended rows until no improvement is achieved, i.e., the best match is not greater than 1 bit. 4.2. Implementation Results and Group Matching By applying the IMA to both Chien search architectures in Fig. 5, the implementation results in terms of number of XOR gates for BCH(2047, 1926, 23) code is listed in Table 2. Table 2. Chien search complexity for BCH(2047, 1926, 23) with parallel factor of 32
(p1)
2(p1)
( t (p1) tp MUX t
D
(p1)i
p MUX 1
D
2p MUX 2 (b)
D
pi ( )
Fig. 5. Two different p-parallel Chien search architectures (a) direct unfolded version (b) equivalent architecture with shorter critical path
4. COMPLEXITY REDUCTION SCHEME In this section, a complexity reduction scheme to eliminate the redundant computations of FFM is discussed in detail. An optimization algorithm developed in [5] can reduce the number of XOR gates for constant FFM operations by up to 40% compared to straightforward implementation. In this paper, a different algorithm called iterative matching algorithm (IMA) is attempted to reduce the area. The main idea is to use iterative sub-structure sharing to eliminate the redundant computations. 4.1. Iterative Matching Algorithm Consider a constant multiplication in GF (2m ) where P is the product of xed operand i , where 1 i t for the design in Fig. 5(a) and 1 i (t p) for the design in Fig. 5(b), and variable eld element B: = = = = i B = i (b0 + b1 + . . . + bm1 m1 ) b0 i + b1 i+1 + . . . + bm1 i+m1 0 10 i+1 . . . i+m1 i b0 0 0 0 B i i+1 . . . i+m1 C B b1 1 1 1 B CB B CB . . . . . . . . @ A@ . . . . i+1 i+m1 i bm1 m1 m1 . . . m1 [coef f ]mm [b]m1 (3) 1 C C C A
implementation methods straightforward implementation IMA within each FFM IMA among 4 FFMs IMA among 8 FFMs IMA among 16 FFMs IMA among 32 FFMs
design design area in Fig. 5(a) in Fig. 5(b) saving 6080 5984 16624 12257 8468 7598 7058 6653 0% 26% 49% 54% 58% 60%
In Table 2 IMA is not only explored within individual FFMs, but also among g FFMs, where g is the group size and 1 < g p, in the same column (see Fig. 5). These FFMs share the same multiplicand i , where 1 i t. This latter case is called group matching among g FFMs and g binary coefcient matrices coef f of g constant FFMs are combined together to search for the best match. In other words, the bit-wise matches are searched in a coefcient matrix with size of gm m instead of m m. Note that group matching is solely possible for the design in Fig. 5(b), which provides another advantage in addition to the lower critical path compared to that in Fig. 5(a). If implemented in a straightforward manner, the design in Fig. 5(a) has much smaller complexity than that in Fig. 5(b) simply because in the latter case the constant FFMs have more powers of with higher Hamming weight as multiplicands. However, as the group matching factor g is increased, the number of
V - 75
XOR gates is reduced signicantly for the design in Fig. 5(b). When g is equal to the parallel factor 32, compared to the straightforward implementation, the area saving is 60% as opposed to merely 26% saving obtained if the iterative matching algorithm is applied to individual FFMs. Consequently, the complexity of Fig. 5(b) is very close to that of Fig. 5(a). Furthermore, the former design retains its shorter critical path advantage. For longer BCH(8191, 7684, 79) code (see Table 3), the complexity for the design in Fig. 5(a) grows very quickly as its error correcting capability t is increased to 39. However, complexity increases only slightly for the design in Fig. 5(b). Moreover, the area saving for the design in Fig. 5(b) after employing group matching is more signicant compared to BCH(2047, 1926, 23) code. In fact its number of XOR gates is already smaller than that of the design in Fig. 5(a) even when the group matching is carried out among 4 FFMs. While the group matching factor g is increased to 32, the complexity of design in Fig. 5(b) is approximately 30% less than that of the design in Fig. 5(a). This implies that for longer BCH codes, lower complexity and faster design could be achieved by applying the proposed group matching scheme. Table 3. Chien search complexity for BCH(8191, 7684, 79) with parallel factor of 32
implementation methods straightforward implementation IMA within each FFM IMA among 4 FFMs IMA among 8 FFMs IMA among 16 FFMs IMA among 32 FFMs design area design in Fig. 5(a) in Fig. 5(b) saving 89664 57504 102277 63997 51029 46329 42534 39365 0% 37% 50% 55% 58% 62%
Table 4. Combined complexity for both Chien search and syndrome units with parallel factor of 32
implementation methods straightforward implementation after individual matching BCH (2047, 1926, 23) area saving after group matching area saving straightforward implementation after individual matching area saving after group matching area saving code parameters Chien Search 16624 12257 26% 6653 60% 102277 63997 37% 39365 62% Syndrome Combined 4372 4181 4% 4181 4% 18933 17733 6% 17733 6% 20996 16438 22% 10834 46% 121210 81730 33% 57098 53%
5. CONCLUSIONS In this paper, to reduce the area consumption of binary high throughput BCH decoder, a novel complexity reduction scheme is proposed. As a result, the parallel decoder architecture can reduce the number of XOR gates by roughly 50% compared to the original design. It also shows a signicant improvement to the previous results where the iterative matching algorithm was applied within individual FFMs. Consequently, an areaefcient design with very short critical path is obtained. All the techniques presented in this paper can be easily extended to RS codes. 6. REFERENCES [1] Forward error correction for submarine systems, Telecommunication standardization section, International Telecommunication Union, G.975, 1996. [2] L. Song, M.-L. Yu, and M. S. Shaffer, 10- and 40-Gb/s forward error correction devices for optical communications, IEEE J. Solid-State Circuits, vol. 37, pp. 1565-1573, Nov. 2002. [3] S. B. Wicker, Error control systems for digital communication and storage, Upper saddle river: NJ: Prentice-Hall, 1995. [4] H. C. Chang, C. B. Shung, and C. Y. Lee, A Reed-Solomon Product-Code (RS-PC) decoder chip for DVD applications, IEEE J. Solid State Circuits, vol. 36, pp. 229-238, Feb. 2001. [5] C. Paar, Optimized arithmetic for Reed-Solomon encoders, IEEE Proc. of ISIT97, pp. 250, 1997. [6] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, Multiple constant multiplications: efcient and versatile framework and algorithms for exploring common subexpression elimination, IEEE Trans. on Computer-Aided Design, vol. 15, no. 2, pp. 151-165, Feb. 1996.
4.3. Apply Group Matching to Both Chien Search and Syndrome Generator Units In a similar manner, the group matching scheme described above can also be applied for the constant FFMs in the feedback loop of syndrome generator units in Fig. 3(b) to reduce the number of XOR gates. Since one of the multiplicands in the feed-forward FFMs is simple a binary number, no multiplication is performed and hence no complexity reduction is needed for those FFMs. The combined complexity of Chien search and syndrome generator units is listed in Table 4. In Table 4 the results for group matching are obtained by applying the IMA among 32 FFMs for Chien search and individual matching for parallel syndrome generator units. From Table 4 we can observe that the complexity of parallel syndrome generator units is dominated by the p-input m-bit FFAs instead of FFM in the feedback loop, which explains why the area saving is small for syndrome part. However, for BCH(2047, 1926, 23) code, 46% area is saved for the combined complexity. This is a signicant improvement compared to the case of directly applying IMA, which only saves 22% XOR gates. Similar results are observed for BCH(8191, 7684, 79) code.
V - 76