If you can’t use a big-integer library, and you don’t have a native uint128_t
(or similar), you’ll need to do this manually.
One option is to express a
as the sum of two 32-bit quantities, i.e. a = 232b + c, where b contains the 32 msbs, and c contains the 32 lsbs. Squaring is then a set of four cross-multiplications; each result is guaranteed to fit into a 64-bit type. You then do the modulo operation as you recombine the individual terms (carefully taking into account the shifts needed to realign everything).