Multi-arch support for pre-built cpu wheel

Currently pre-built cpu wheels are compiled for the lowest common denominator architecture. If we can compile multiple versions of the llama.cpp library with different accelerations compiled in each we can bundle all of them into the pre-built wheels and dynamically choose one based on the host cpu.