Biến đổi SubBytes() đối với mảng trạng thái- 123docz.net

Hàm ShiftRows()

Thực hiện bằng cách các byte trong ba hàng cuối của mảng trạng thái sẽ được dịch vòng với số lần dịch khác nhau. Hàng đầu tiên r=0 không bị dịch như sau:

S’rc = Sr,(c+shift(r,Nb))modNb (Nb=4) trong đó giá trị dịch shift(r,Nb) phụ thuộc vào số hàng r như sau:

shift(1,4)=1, shift(2,4)=2, shift(3,4)=3

Các byte thấp nhất sẽ được chuyển lên đầu hàng, trong khi các byte khác sẽ tới các vị trí thấp hơn trong các hàng. Có thể xem minh họa như hình dưới:

Hàm MixColumns()

Thực hiện trên các cột của mảng trạng thái, coi mỗi cột của mảng trạng thái như là một đa thức gồm 4 hạng tử. Các cột sẽ được xem như là các đa thức trên GF(28) và được nhân trên modulo x4+1 với một đa thức cố định a(x):

a(x) = {03}x3 + {02}x2 + {01}x + {02} Có thể biểu diễn bằng phép nhân ma trận:

s’(x) = a(x)  s(x)

= Với mọi 0≤c<Nb = 4

Mỗi cột sẽ có bốn byte sẽ được thay thế theo công thức sau: s’0,c = ({02}●s0,c)({03}●s1,c) s2,c s3,c s’1,c = s0,c({02}●s1,c) ({03}●s2,c) s3,c s’2,c = s0,cs1,c ({02}●s2,c) ({03}●s3,c) s’3,c = ({03}●s0,c)s1,c s2,c ({02}●s3,c) Dưới đây là hình minh họa:

Hàm AddRoundKey()

Thực hiện bằng cách một khóa vòng sẽ được cộng vào mảng trạng thái bằng phép toán XOR bit. Sinh ra bởi thủ tục sinh khóa, mỗi khóa vòng gồm Nb word. Sau đó các word này sẽ được cộng vào mỗi cột của mảng trạng thái:

[s’0,c, s’1,c, s’2,c, s’3,c] = [s0,c, s1,c, s2,c, s3,c]  [wround*Nb+c] 0≤c≤Nb = 4 Round là lần lặp tương ứng với quy ước 0≤round≤Nb, trong đó [wi] là các word của khóa. Trước khi các vòng lặp của thuật toán được thực hiện thuật toán mã hóa phép cộng khóa vòng khởi tạo xảy ra với round=0. Hàm này được thực hiện trong thuật toán mã hóa khi 1≤round≤Nb.

Dưới đây là minh họa, trong đó l=round*Nb: 

Hình 23: Mô tả hàm AddRoundKey()

Thuật toán sinh khóa (Key Expansion)

Để sinh một dãy các khóa cho việc mã hóa, thuật toán này nhận một khóa mã hóa K sau đó thực hiện một thủ tục sinh khóa. Thủ tục sử dụng một tập khởi tạo Nb word và mỗi lần lặp trong số Nr sẽ cần tới Nb word của dữ liệu khóa và sẽ sinh tổng số Nb*(Nr+1) word. Kết quả là một mảng tuyến tính các word 4 byte được ký hiệu là [wi] trong đó 0≤i<Nb(Nr+1).

Áp dụng bảng thế S-box lên input để nhận được một word output, SubWord() là hàm nhận một input 4 byte. Hàm RotWord() nhận một word input [a0, a1, a2, a3] thực hiện một hoán vị vòng và trả về [a1, a2, a3, a0]. Các phần tử của mảng hằng số

Rcon[i] chứa các giá trị nhận được bởi [xi-1, {00}, {00}, {00}] trong đó xi-1 là mũ hóa của x (x được biểu diễn dưới dạng {02} trên GF(28) và i bắt đầu từ 1)

Nk word của khóa kết quả sẽ được điền bởi khóa mã hóa. Các word sau đó w[i] sẽ bằng XOR với word đứng trước nó w[i-1] và w[i-Nk]. Trước khi thực hiện phép XOR bit với các word ở vị trí chia hết cho Nk một biến đổi sẽ được thực hiện với w[i-1], sau đó là phép XOR với một hằng số Rcon[i]. Biến đổi này gồm một phép dịch vòng các byte của một word sau đó là áp dụng một bảng tra lên tất cả 4 byte của word. So với thủ tục cho các khóa có độ dài 128 hoặc 192 bit thủ tục mở rộng khóa đối với các khóa có độ dài 256 bit hơi khác. Trước khi thực hiện phép XOR bit nếu Nk=8 và i-4 là bội số của Nk thì SubWord() sẽ được áp dụng cho w[i-1].

3.2.2 Công đoạn giải mã

Giống với công đoạn mã hóa nhưng các hàm trong công đoạn giải mã là các hàm ngược của các hàm ở công đoạn mã hóa.[4].

Hàm InvShiftRow()

Là hàm ngược của hàm ShiftRows(), các byte của ba hàng cuối của mảng trạng thái sẽ được dịch vòng với vị trí dịch khác nhau. Ba hàng cuối bị dịch đi Nb- shift(r,Nb) byte trong đó giá trị shift(r,Nb) phụ thuộc vào số hàng, hàng đầu tiên không bị dịch.

Hàm này thực hiện như sau:

s’r,(c+shift(r,Nb))modNb = sr,c 0<r<4, 0≤c<Nb (Nb=4) Dưới đây là hình ảnh minh họa:

Hình 24: Mô tả hàm InvShiftRow()

Hàm InvSubBytes()

Là hàm ngược của hàm SubBytes(). Bằng cách thực hiện nhân nghịch đảo trên GF(28), hàm sử dụng nghịch đảo của biến đổi Affine.

Hàm InvMixColumns()

Là hàm ngược của hàm MixColumns(). Coi mỗi cột như một đa thức 4 hạng tử hàm này thực hiện làm việc trên các cột của mảng trạng thái. Các cột được xem như là các đa thức trên GF(28) và được nhân theo modulo x4+1 với một đa thức cố định là a-1(x)

a-1(x) = {0b}x3 + {0d}x2 + {09}x + {0e} Được mô tả bằng phép nhân ma trận sau:

s’(x) = a-1(x) s(x)

= trong đó 0≤c<Nb

Mỗi cột sẽ có bốn byte được thay theo công thức sau:

s’1,c = ({09}●s0,c)({0e}●s1,c) ({0b}●s2,c) ({0d}●s3,c) s’2,c = ({0d}●s0,c)({09}●s1,c) ({0e}●s2,c) ({0b}●s3,c) s’3,c = ({0b}●s0,c)({0d}●s1,c) ({09}●s2,c) ({0e}●s3,c)

Hàm nghịch đảo của hàm AddRoundKey()

Hàm nghịch của hàm AddRoundKey() cũng chính là nó vì hàm này chỉ có phép toán XOR bit

3.3 Chương trình thuật toán song song mã hóa AES sử dụng GPU Chương trình song song sử dụng GPU: Chương trình song song sử dụng GPU:

File thuật toán

#include "aes.h"

// state - array holding the intermediate results during decryption. typedef uint8_t state_t[4][4];

// The array that stores the round keys. //__device__ static const uint8_t* RoundKey;

__device__ uintmax_t get_global_index(void) {

return blockIdx.x * blockDim.x + threadIdx.x; }

// prints string as hex

__device__ static void phex(uint8_t* str) { unsigned char i;

for (i = 0; i < 16; ++i)

printf("%.2x", str[i]);

printf("\n"); }

uintmax_t idx = get_global_index(); uint8_t i, j; //for (i = 0; i < 4; i++) printf("[thread %lld] state %s\n%.2x %.2x %.2x %.2x\n%.2x %.2x %.2x %.2x\n%.2x %.2x %.2x %.2x\n%.2x %.2x %.2x %.2x\n", idx, message,

(*state)[0][0], (*state)[0][1], (*state)[0][2], (*state)[0][3],

(*state)[1][0], (*state)[1][1], (*state)[1][2], (*state)[1][3],

(*state)[2][0], (*state)[2][1], (*state)[2][2], (*state)[2][3],

(*state)[3][0], (*state)[3][1], (*state)[3][2], (*state)[3][3]);

}

//__device__ static void printKey() {

// printf("RoundKey:\n");

// unsigned char i, j;

// for (j = 0; j < ROUNDS + 1; ++j) {

// for (i = 0; i < KEYLENGTH; ++i)

// printf("%.2x", RoundKey[(j*KEYLENGTH) + i]);

// printf("\n");

// }

//}

// Lookup-tables

__device__ __constant__ uint8_t d_sbox[256] = {

//0 1 2 3 4 5 6 7 8 9 A B C D E F

0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,

0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,

0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15,

0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75,

0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84,

0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf, 0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8, 0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2,

0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73,

0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb,

0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79,

0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08,

0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,

0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e,

0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf,

0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 };

// XOR the round key on state.

__device__ void AddRoundKey(state_t* state, uint8_t* roundKey, uint8_t round) { //uintmax_t idx = get_global_index();

//printf("[Thread %lld] roundKey: %.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x%.2x\n", idx, // roundKey[round*BLOCKSIZE + 0], roundKey[round*BLOCKSIZE + 1], roundKey[round*BLOCKSIZE + 2], roundKey[round*BLOCKSIZE + 3], // roundKey[round*BLOCKSIZE + 4], roundKey[round*BLOCKSIZE + 5], roundKey[round*BLOCKSIZE + 6], roundKey[round*BLOCKSIZE + 7], // roundKey[round*BLOCKSIZE + 8], roundKey[round*BLOCKSIZE + 9], roundKey[round*BLOCKSIZE + 10], roundKey[round*BLOCKSIZE + 11], // roundKey[round*BLOCKSIZE + 12], roundKey[round*BLOCKSIZE + 13], roundKey[round*BLOCKSIZE + 14], roundKey[round*BLOCKSIZE + 15]); uint8_t i, j;

for (i = 0; i<4; ++i) {

for (j = 0; j < 4; ++j) {

//printf("[Thread %lld] (*state)[%d][%d] before: %.2x\n", idx, i,

(*state)[i][j] ^= roundKey[round * LANESIZE * 4 + i * LANESIZE + j];

//printf("[Thread %lld] (*state)[%d][%d] after: %.2x\n", idx, i, j,

(*state)[i][j]);

}

} }

// The SubBytes Function Substitutes the values in the // state matrix with values in an S-box.

__device__ void SubBytes(state_t* state, uint8_t* s_sbox) { uint8_t i, j; for (i = 0; i < 4; ++i) { for (j = 0; j < 4; ++j) { (*state)[j][i] = s_sbox[(*state)[j][i]]; } } }

// The ShiftRows() function shifts the rows in the state to the left. // Each row is shifted with different offset.

// Offset = Row number. So the first row is not shifted. __device__ void ShiftRows(state_t* state)

{

uint8_t temp;

// Rotate first row 1 columns to left temp = (*state)[0][1];

(*state)[0][1] = (*state)[1][1]; (*state)[1][1] = (*state)[2][1]; (*state)[2][1] = (*state)[3][1]; (*state)[3][1] = temp;

temp = (*state)[0][2]; (*state)[0][2] = (*state)[2][2]; (*state)[2][2] = temp; temp = (*state)[1][2]; (*state)[1][2] = (*state)[3][2]; (*state)[3][2] = temp;

// Rotate third row 3 columns to left temp = (*state)[0][3]; (*state)[0][3] = (*state)[3][3]; (*state)[3][3] = (*state)[2][3]; (*state)[2][3] = (*state)[1][3]; (*state)[1][3] = temp; }

__device__ uint8_t xtime(uint8_t x) {

return ((x << 1) ^ (((x >> 7) & 1) * 0x1b)); }

// MixColumns function mixes the columns of the state matrix __device__ void MixColumns(state_t* state)

{ uint8_t i; uint8_t Tmp, Tm, t; for (i = 0; i < 4; ++i) { t = (*state)[i][0];

Tmp = (*state)[i][0] ^ (*state)[i][1] ^ (*state)[i][2] ^ (*state)[i][3];

Tm = (*state)[i][0] ^ (*state)[i][1]; Tm = xtime(Tm); (*state)[i][0] ^=

Tm ^ Tmp;

Tm = (*state)[i][1] ^ (*state)[i][2]; Tm = xtime(Tm); (*state)[i][1] ^=

Tm ^ Tmp;

Tm = (*state)[i][2] ^ (*state)[i][3]; Tm = xtime(Tm); (*state)[i][2] ^=

Tm ^ Tmp;

} }

// Cipher is the main function that encrypts the PlainText.

__device__ void Cipher(state_t* state, uint8_t* roundKey, uint8_t* s_sbox) {

uint8_t round = 0;

// Add the First round key to the state before starting the rounds. AddRoundKey(state, roundKey, round);

//print_state(state, "after first round key added");

// There will be ROUNDS rounds.

// The first ROUNDS-1 rounds are identical.

// These ROUNDS-1 rounds are executed in the loop below. for (round = 1; round < ROUNDS; ++round)

{

SubBytes(state, s_sbox);

ShiftRows(state);

MixColumns(state);

AddRoundKey(state, roundKey, round);

//print_state(state, "after round key added");

}

// The last round is given below.

// The MixColumns function is not here in the last round. SubBytes(state, s_sbox);

ShiftRows(state);

AddRoundKey(state, roundKey, ROUNDS);

//print_state(state, "after last round key added"); }

__device__ void AES128_ECB_encrypt(uint8_t* ciphertext_block, uint8_t* roundKey, uint8_t* s_sbox) {

state_t* state = (state_t*)ciphertext_block; //print_state(state, "after init");

// The next function call encrypts the PlainText with the Key using AES algorithm. Cipher(state, roundKey, s_sbox);

}

__global__ void cuda_encrypt_block(uint8_t* d_ciphertext, uint8_t* d_plaintext, uint8_t* d_roundKey, uintmax_t plaintext_blocks) {

uintmax_t idx = blockIdx.x * blockDim.x + threadIdx.x; __shared__ uint8_t s_roundKey[BLOCKSIZE * (ROUNDS + 1)];

//__shared__ uint8_t s_ciphertext[BLOCKSIZE * THREADS_PER_BLOCK]; __shared__ uint8_t s_sbox[256];

uintmax_t offset = idx*BLOCKSIZE;

uintmax_t block_offset = (idx % THREADS_PER_BLOCK) * BLOCKSIZE;

// if there are enough THREADS_PER_BLOCK, the round key allocation to shared memory is performed by (ROUNDS + 1) threads in parallel

if (THREADS_PER_BLOCK >= (ROUNDS + 1) && (idx % THREADS_PER_BLOCK) < (ROUNDS + 1)) {

memcpy(s_roundKey + block_offset, d_roundKey + block_offset, BLOCKSIZE);

}

// if not, this is done only by the first thread in a block else if ((idx % THREADS_PER_BLOCK) == 0) {

memcpy(s_roundKey, d_roundKey, BLOCKSIZE*(ROUNDS + 1));

}

// first thread in a block copies sbox from constant to shared memory if ((idx % THREADS_PER_BLOCK) == 0) {

memcpy(s_sbox, d_sbox, sizeof(uint8_t) * 256);

}

__syncthreads();

if (idx < plaintext_blocks) {

memcpy(d_ciphertext + offset, d_plaintext + offset, BLOCKSIZE);

// each plaintext block is encrypted by an individual thread

AES128_ECB_encrypt(d_ciphertext + block_offset, s_roundKey, s_sbox);

//memcpy(d_ciphertext + offset, s_ciphertext + block_offset,

sizeof(uint8_t)*BLOCKSIZE); } } File chạy #define DEBUG 0 #include "aes.h" #include <stdio.h>

static double encrypt_file(char* outfile, char* infile, uint8_t* key); static void __host__ phex(uint8_t* str);

uint8_t key[16] = { (uint8_t)0x2b, (uint8_t)0x7e, (uint8_t)0x15, (uint8_t)0x16, (uint8_t)0x28, (uint8_t)0xae, (uint8_t)0xd2, (uint8_t)0xa6,

(uint8_t)0xab, (uint8_t)0xf7, (uint8_t)0x15, (uint8_t)0x88, (uint8_t)0x09, (uint8_t)0xcf, (uint8_t)0x4f, (uint8_t)0x3c };

// The array that stores the round keys. uint8_t h_roundKey[176];

boolean silent = 0;

void print_usage() {

printf("Usage: aes_parallel.exe <input file><output file> [--silent]\n"); return;

}

if (argc < 3 || argc > 4) { print_usage(); return 1; } double cpu_time_used; if (argc == 4) if (!strcmp(argv[3], "--silent")) silent = 1;

cpu_time_used = encrypt_file(argv[1], argv[2], key); printf("Execution time: %6.9f seconds\n", cpu_time_used);

printf("Press enter to continue...\n"); getchar();

return 0; }

double encrypt_file(char* infile, char* outfile, uint8_t* key) { FILE *fp_in;

FILE *fp_out;

#if defined(DEBUG) && DEBUG uint8_t i;

#endif

fp_in = fopen(infile, "rb"); if (fp_in == NULL && !silent) {

fprintf(stderr, "Can't open input file %s!\n", infile);

exit(1);

}

fp_out = fopen(outfile, "wb+"); if (fp_out == NULL && !silent) {

exit(1); }

KeyExpansion(key);

#if defined(DEBUG) && DEBUG printf("Round Keys:\n");

for (i = 0; i < ROUNDS + 1; i++) {

phex(h_roundKey + (i * BLOCKSIZE));

} #endif

// determine size of file, read file into h_plaintext and determine number of plaintext blocks

fseek(fp_in, 0, SEEK_END);

uintmax_t plaintext_size = ftell(fp_in); rewind(fp_in);

uint8_t* h_plaintext = (uint8_t*)malloc(plaintext_size);

uintmax_t bytes_read = fread(h_plaintext, sizeof(uint8_t), plaintext_size, fp_in);

assert(bytes_read == plaintext_size);

uintmax_t plaintext_blocks = (bytes_read + BLOCKSIZE - 1) / BLOCKSIZE; uint8_t* h_ciphertext = (uint8_t*)malloc(plaintext_blocks*BLOCKSIZE);

if (!silent) {

printf("File size: %llu bytes\n", plaintext_size);

printf("Number of plaintext blocks: %llu (blocksize: %d bytes)\n",

plaintext_blocks, BLOCKSIZE); }

#if defined(DEBUG) && DEBUG printf("Plaintext:\n");

for (i = 0; i < plaintext_blocks; i++) {

phex(h_plaintext + (i * BLOCKSIZE));

} #endif

cudaError_t cudaStatus;

uintmax_t threads_per_block = THREADS_PER_BLOCK;

uintmax_t number_of_blocks = (plaintext_blocks + threads_per_block - 1) / threads_per_block;

uintmax_t shared_memory_size = BLOCKSIZE * THREADS_PER_BLOCK + BLOCKSIZE * (ROUNDS + 1) + 256;

if (!silent) {

printf("Launching kernel with configuration:\n");

printf("Threads per block: %lld\n", threads_per_block);

printf("Number of blocks: %lld\n", number_of_blocks);

printf("Shared memory size (per block): %lld\n", shared_memory_size);

}

// measure time double cpu_time_used; LARGE_INTEGER frequency; LARGE_INTEGER start, end;

QueryPerformanceFrequency(&frequency);

// start timer

QueryPerformanceCounter(&start);

// copy h_plaintext and h_roundKey into global device memory uint8_t* d_plaintext;

cudaStatus = cudaMalloc((void**)&d_plaintext, sizeof(uint8_t) * (plaintext_blocks * BLOCKSIZE)); // TODO if last block is smaller than BLOCKSIZE, the block maybe needs to be initialized with zero bits, test if this has to be done

if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMalloc failed!");

goto Error;

}

// make sure the last block is padded with zero bytes by initializing the full array with zero bytes

cudaStatus = cudaMemset(d_plaintext, 0, sizeof(uint8_t) * (plaintext_blocks * BLOCKSIZE));

if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMemset failed!");

goto Error;

}

cudaStatus = cudaMemcpy(d_plaintext, h_plaintext,

sizeof(uint8_t)*plaintext_size, cudaMemcpyHostToDevice); if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMemcpy failed!");

goto Error;

}

uint8_t* d_roundKey;

cudaMalloc((void**)&d_roundKey, sizeof(uint8_t)*BLOCKSIZE*(ROUNDS+1)); if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMalloc failed!");

goto Error;

}

cudaMemcpy(d_roundKey, h_roundKey, sizeof(uint8_t)*BLOCKSIZE*(ROUNDS + 1), cudaMemcpyHostToDevice);

if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMemcpy failed!");

goto Error;

}

// allocate space for the ciphertext on the device uint8_t* d_ciphertext;

cudaStatus = cudaMalloc((void**)&d_ciphertext, sizeof(uint8_t) *

(plaintext_blocks * BLOCKSIZE));

if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMalloc failed!");

goto Error;

}

// reset last error cudaGetLastError();

cuda_encrypt_block<<<number_of_blocks,

threads_per_block/*,shared_memory_size*/>>>(d_ciphertext, d_plaintext, d_roundKey, plaintext_blocks);

cudaStatus = cudaGetLastError();

if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "Kernel launch failed: %s\n",

cudaGetErrorString(cudaStatus));

goto Error;

}

cudaStatus = cudaDeviceSynchronize(); if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaDeviceSynchronize failed: %s\n",

cudaGetErrorString(cudaStatus));

goto Error;

}

// Copy ciphertext array from device memory to host memory.

cudaStatus = cudaMemcpy(h_ciphertext, d_ciphertext, sizeof(uint8_t) *

(plaintext_blocks * BLOCKSIZE), cudaMemcpyDeviceToHost); if (cudaStatus != cudaSuccess && !silent) {

fprintf(stderr, "cudaMemcpy failed!");

goto Error;

}

// stop timer

QueryPerformanceCounter(&end);

cpu_time_used = ((double)(end.QuadPart - start.QuadPart)) /

((double)frequency.QuadPart);

#if defined(DEBUG) && DEBUG

printf("Ciphertext after kernel returned:\n"); for (i = 0; i < plaintext_blocks; i++) {

phex(h_ciphertext + (i * BLOCKSIZE));

} #endif

// write ciphertext to output file

fwrite(h_ciphertext, sizeof(uint8_t), BLOCKSIZE * plaintext_blocks, fp_out);

if (!silent)

printf("\nEncryption of %llu plaintext blocks successful!\n",

plaintext_blocks); return cpu_time_used; Error: free(h_plaintext); free(h_ciphertext); free(h_roundKey); cudaFree(d_plaintext); cudaFree(d_ciphertext); cudaFree(d_roundKey); fclose(fp_in); fclose(fp_out); exit(1); }

// This function produces (ROUNDS+1) round keys. The round keys are used in each round to decrypt the states.

void KeyExpansion(uint8_t* key) { uint32_t i, j, k;

uint8_t tempa[4]; // Used for the column/row operations

// The first round key is the key for (i = 0; i < KEYWORDS; ++i) {

h_roundKey[(i * 4) + 0] = key[(i * 4) + 0];

h_roundKey[(i * 4) + 1] = key[(i * 4) + 1];

h_roundKey[(i * 4) + 3] = key[(i * 4) + 3]; }

// All other round keys are found from the previous round keys. for (; (i < (LANESIZE * (ROUNDS + 1))); ++i)

{ for (j = 0; j < 4; ++j) { tempa[j] = h_roundKey[(i - 1) * 4 + j]; } if (i % KEYWORDS == 0) {

// This function rotates the 4 bytes in a word to the left once.

// [a0,a1,a2,a3] becomes [a1,a2,a3,a0]

// Function RotWord() { k = tempa[0]; tempa[0] = tempa[1]; tempa[1] = tempa[2]; tempa[2] = tempa[3]; tempa[3] = k; }

// SubWord() is a function that takes a four-byte input word and

// applies the S-box to each of the four bytes to produce an output

word. // Function Subword() { tempa[0] = sbox[tempa[0]]; tempa[1] = sbox[tempa[1]]; tempa[2] = sbox[tempa[2]]; tempa[3] = sbox[tempa[3]]; }

tempa[0] = tempa[0] ^ Rcon[i / KEYWORDS];

}

else if (KEYWORDS > 6 && i % KEYWORDS == 4)

{ // Function Subword() { tempa[0] = sbox[tempa[0]]; tempa[1] = sbox[tempa[1]]; tempa[2] = sbox[tempa[2]]; tempa[3] = sbox[tempa[3]]; } }

h_roundKey[i * 4 + 0] = h_roundKey[(i - KEYWORDS) * 4 + 0] ^ tempa[0];

h_roundKey[i * 4 + 1] = h_roundKey[(i - KEYWORDS) * 4 + 1] ^ tempa[1];

h_roundKey[i * 4 + 2] = h_roundKey[(i - KEYWORDS) * 4 + 2] ^ tempa[2];

h_roundKey[i * 4 + 3] = h_roundKey[(i - KEYWORDS) * 4 + 3] ^ tempa[3];

} }

// prints string as hex

static void phex(uint8_t* str) { unsigned char i; for (i = 0; i < 16; ++i) printf("%.2x", str[i]); printf("\n"); } Chương trình sử dụng CPU:

File thuật toán

#include <stdint.h>

#include <string.h> // CBC mode, for memset #include "aes.h"

typedef uint8_t state_t[4][4]; static state_t* state;

// The array that stores the round keys. static const uint8_t* RoundKey;

// prints string as hex

static void phex(uint8_t* str) { unsigned char i;

for (i = 0; i < 16; ++i)

printf("%.2x", str[i]);

printf("\n"); }

static void print_state() { uint8_t i, j; printf("state:\n"); for (i = 0; i < 4; i++) { for (j = 0; j < 4; j++) printf("%.2x", (*state)[i][j]); printf("\n"); } }

static void printKey() { printf("RoundKey:\n"); unsigned char i, j;

for (j = 0; j < ROUNDS + 1; ++j) {

for (i = 0; i < KEYLENGTH; ++i)

printf("%.2x", RoundKey[(j*KEYLENGTH)+i]);

printf("\n");

} }

// x to th power (i-1) being powers of x (x is denoted as {02}) in the field GF(2^8)

// Note that i starts at 1, not 0). static const uint8_t Rcon[255] = {

0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36, 0x6c, 0xd8, 0xab, 0x4d, 0x9a,

0x2f, 0x5e, 0xbc, 0x63, 0xc6, 0x97, 0x35, 0x6a, 0xd4, 0xb3, 0x7d, 0xfa, 0xef, 0xc5, 0x91, 0x39,

0x72, 0xe4, 0xd3, 0xbd, 0x61, 0xc2, 0x9f, 0x25, 0x4a, 0x94, 0x33, 0x66, 0xcc, 0x83, 0x1d, 0x3a,

0x74, 0xe8, 0xcb, 0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36, 0x6c, 0xd8,

0xab, 0x4d, 0x9a, 0x2f, 0x5e, 0xbc, 0x63, 0xc6, 0x97, 0x35, 0x6a, 0xd4, 0xb3, 0x7d, 0xfa, 0xef,

0xc5, 0x91, 0x39, 0x72, 0xe4, 0xd3, 0xbd, 0x61, 0xc2, 0x9f, 0x25, 0x4a, 0x94, 0x33, 0x66, 0xcc,

0x83, 0x1d, 0x3a, 0x74, 0xe8, 0xcb, 0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b,

0x36, 0x6c, 0xd8, 0xab, 0x4d, 0x9a, 0x2f, 0x5e, 0xbc, 0x63, 0xc6, 0x97, 0x35, 0x6a, 0xd4, 0xb3,

0x7d, 0xfa, 0xef, 0xc5, 0x91, 0x39, 0x72, 0xe4, 0xd3, 0xbd, 0x61, 0xc2, 0x9f, 0x25, 0x4a, 0x94,

0x33, 0x66, 0xcc, 0x83, 0x1d, 0x3a, 0x74, 0xe8, 0xcb, 0x8d, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20,

0x40, 0x80, 0x1b, 0x36, 0x6c, 0xd8, 0xab, 0x4d, 0x9a, 0x2f, 0x5e, 0xbc, 0x63, 0xc6, 0x97, 0x35,

0x6a, 0xd4, 0xb3, 0x7d, 0xfa, 0xef, 0xc5, 0x91, 0x39, 0x72, 0xe4, 0xd3, 0xbd, 0x61, 0xc2, 0x9f,

Biến đổi SubBytes() đối với mảng trạng thái

Các kỹ thuật tính toán trênGPU

.Các giải thuật ứng dụng trênGPU