How to get a random number in ASM on the Commander X16 ?

BruceMcF

Active Member
May 19, 2019
150
49
28
Good point. Though if the output were shifted in parallel with testing the range, that would save having a lookup table. That said, having a "top bit" table would be useful in general, even if it's a lot of memory to use for this. Ie, getting the mask could be just a lookup or two.

I'm running BigCrush again with reversed bits, that will help test if the low bits are good. I'll be disappointed if they aren't.

Also I fixed a couple typos in the code above.
I guess it varies with use case ... if a typical repeated call in a loop is going to have the same n (eg, dice game with 6 sized dice, random position in a square array), remembering the last n and last mask will avoid the overhead after the first call, making the mask generation overhead trivial, so it is mostly a convenience to the caller to infer the mask rather than demand that it be passed.

But if a typical repeated call in a loop is going to have different value, might allow the caller to hand the mask or #0, and only compute the mask when the caller doesn't hand it.
 

JoshuaScholar

New Member
Nov 6, 2019
24
10
3
Ok, I've verified that the 65c02 code has the same output as the C code. I've edited bug fixes into the simplified (ie second) version above and verified that it assembles on ca65.
However it is true that I don't use ca65, I use a built in 65c02 assembler in c++ that I wrote into a version of the emulator. I could paste the actual tested code, but it's syntax is weird and needs a little explanation. But by comparing against the other version you could tell what it means.
One note, instructions that end "abs" are 16 bit address loads. Ones that end "ab" are smarter and if the address is under 256 they're zero page loads.
same with "abx" vs. "absx" where the x means indexed by x.

Update, it now has the utility routines, random number up to n for bytes and random number up to n for words.
It's all tested. And if you want to try this yourself you can get it here.

I'm using this repository to develop a compiler, you can ignore the huge amount of ifdefed out code in compile.cpp.

But under the ifdeffed out code is the rand number generator.

Please look at the comments, I have detailed comments at the head of each assembly language routine.

The basic information is that the seed is 3 bytes in rngseed1tmp on zero page, and 3 bytes in rngseed2tmp on zero page.

You call seedrandom to process that seed and initialize the random number generator.

And after that, you can call rndbyte for each byte you need. x and y are untouched and the random byte is in a.

the utilities are rndbytevalue and rndwordvalue
rndbytevalue takes a range in "a" and returns a value in "a". It does not preserve y, but does preserve x.
rndwordvalue takes a low part of the limit in x and the high part in a. It returns the low random value in x and the high in a. y is not preserved.

There isn't room to include the utility routines, a table for them or the test routines for the code, I ran past the 10000 character limit. See the repository.

C++:
{

    const int MLEN1 = 71;
    const int MLAG1 = 65;
    const int MLEN2 = 55;
    const int MLAG2 = 24;

    Label rngstate1, rngstate2, rngcache, rngcarry1, rngcarry2, rngi1, rngs1, rngi2, rngs2, rngseed1, rngseed2;
    Label conditionshiftseedgen, shft3left, shft3right, rot3right, eor3shftright, eor3shftleft, seedshiftreggen;
    Label seedrandom, rndbyte, rndbytevalue, rnd_mask_table, starttest, dotest, dorndbytevalue, dorndwordvalue,  rndwordvalue;
    Label t1, t2, t3, t4, t5, t6, t7, a1, a2, a3, a4, a5, b1, b2, b3, b4, mff, m7f, m3f;
    Label c1, c2, c3, c4, c5, c6, cmff, cm7f, cm3f;


    const int rngseed1tmp = 0x2, rngseed2tmp = 0x5, rngseed3tmp = 0x8, rngseed4tmp = 0xb, rngseed5tmp = 0xe,
        rngtmp = 0x10, rngtmp2 = 0x11, rngtmp3 = 0x12, rngtmp4 = 0x14;


    compile_point = 0x801;

    rngseed1.here();
    comp_byte(0x0b);
    comp_byte(0xb0);
    comp_byte(0xef);
    rngseed2.here();
    comp_byte(0xbe);
    comp_byte(0xad);
    comp_byte(0xde);
    rngstate1.here();
    compile_point += MLEN1;
    rngstate2.here();
    compile_point += MLEN2;
    rngcache.here();
    compile_point += 1;
    rngcarry1.here();
    compile_point += 1;
    rngcarry2.here();
    compile_point += 1;
    rngi1.here();
    compile_point += 1;
    rngs1.here();
    compile_point += 1;
    rngi2.here();
    compile_point += 1;
    rngs2.here();
    compile_point += 1;
    rngi2.here();
    compile_point += 1;

    //randomize the seed input bits a little so that
//inputs like 1 or 2 won't be kind of poor
conditionshiftseedgen.here();
    lda_imm(0xb5);
    eor_zp(rngseed1tmp);
    sta_zp(rngseed1tmp);
    lda_imm(0xb0);
    eor_zp(rngseed1tmp + 1);
    sta_zp(rngseed1tmp + 1);
    lda_imm(0xcc);
    eor_zp(rngseed1tmp + 2);
    sta_zp(rngseed1tmp + 2);
    ora_zp(rngseed1tmp + 1);
    ora_zp(rngseed1tmp);    //if seed is 0 then load something else
    bne(t1);
    lda_imm(0xf3);
    sta_zp(rngseed1tmp);
    sta_zp(rngseed1tmp + 1);
    sta_zp(rngseed1tmp + 2);
t1.here();
    lda_imm(0xc7);
    eor_zp(rngseed2tmp);
    sta_zp(rngseed2tmp);
    lda_imm(0x49);
    eor_zp(rngseed2tmp + 1);
    sta_zp(rngseed2tmp + 1);
    lda_imm(0xd2);
    eor_zp(rngseed2tmp + 2);
    sta_zp(rngseed2tmp + 2);
    ora_zp(rngseed2tmp + 1);
    ora_zp(rngseed2tmp);
    bne(t2);
    lda_imm(0x5a);
    sta_zp(rngseed2tmp);
    sta_zp(rngseed2tmp + 1);
    sta_zp(rngseed2tmp + 2);
t2.here();
    rts();

    //input address in x()
    //number of bits to shift in y
shft3left.here();
t3.here();
    asl_zpx(0);
    rol_zpx(1);
    rol_zpx(2);
    dey();
    bne(t3);
    rts();
    //input address in (x)
    //number of bits to shift in y
shft3right.here();
t4.here();
    lsr_zpx(2);
    ror_zpx(1);
    ror_zpx(0);
    dey();
    bne(t4);
    rts();

    //input address in (x)
    //number of bits to shift in y
rot3right.here();
    lda_zpx(0);
    lsr();
    ror_zpx(2);
    ror_zpx(1);
    ror_zpx(0);
    dey();
    bne(rot3right);
    rts();


    //do one tap
    //num bits to shift in y
    //input/output (x)
    //uses_zp(rngseed3tmp
eor3shftright.here();
    lda_zpx(0);
    sta_zp(rngseed3tmp);
    lda_zpx(1);
    sta_zp(rngseed3tmp + 1);
    lda_zpx(2);
    sta_zp(rngseed3tmp + 2);
    jsr(shft3right);
    lda_zpx(0);
    eor_zp(rngseed3tmp);
    sta_zpx(0);
    lda_zpx(1);
    eor_zp(rngseed3tmp + 1);
    sta_zpx(1);
    lda_zpx(2);
    eor_zp(rngseed3tmp + 2);
    sta_zpx(2);
    rts();

    //do one tap
    //num bits to shift in x
    //input/output (shiftparam)
    //uses_zp(rngseed3tmp
eor3shftleft.here();
    lda_zpx(0);
    sta_zp(rngseed3tmp);
    lda_zpx(1);
    sta_zp(rngseed3tmp + 1);
    lda_zpx(2);
    sta_zp(rngseed3tmp + 2);
    jsr(shft3left);
    lda_zpx(0);
    eor_zp(rngseed3tmp);
    sta_zpx(0);
    lda_zpx(1);
    eor_zp(rngseed3tmp + 1);
    sta_zpx(1);
    lda_zpx(2);
    eor_zp(rngseed3tmp + 2);
    sta_zpx(2);
    rts();

    //generate next initialization byte
    //state in_zp(rngseed1tmp,rngseed2tmp
    //output in a
    //uses_zp(rngseed1tmp-rngseed5tmp
    //does not touch x,y
seedshiftreggen.here();
    phx();
    phy();

    ldx_imm(rngseed1tmp);

    ldy_imm(3);
    jsr(eor3shftleft);
    ldy_imm(15);
    jsr(eor3shftright);
    ldy_imm(11);
    jsr(eor3shftleft);

    lda_zp(rngseed1tmp);
    sta_zp(rngseed4tmp);
    lda_zp(rngseed1tmp + 1);
    sta_zp(rngseed4tmp + 1);
    lda_zp(rngseed1tmp + 2);
    sta_zp(rngseed4tmp + 2);

    ldx_imm(rngseed4tmp);
    ldy_imm(6);
    jsr(eor3shftleft);
    ldy_imm(1);
    jsr(eor3shftright);
    ldy_imm(9);
    jsr(eor3shftleft);

    ldx_imm(rngseed2tmp);

    ldy_imm(13);
    jsr(eor3shftleft);
    ldy_imm(5);
    jsr(eor3shftright);
    ldy_imm(8);
    jsr(eor3shftleft);

    lda_zp(rngseed2tmp);
    sta_zp(rngseed5tmp);
    lda_zp(rngseed2tmp + 1);
    sta_zp(rngseed5tmp + 1);
    lda_zp(rngseed2tmp + 2);
    sta_zp(rngseed5tmp + 2);

    ldx_imm(rngseed5tmp);
    ldy_imm(11);
    jsr(eor3shftleft);
    ldy_imm(1);
    jsr(eor3shftright);
    ldy_imm(8);
    jsr(eor3shftleft);

    clc();
    lda_zp(rngseed5tmp);
    adc_zp(rngseed4tmp);
    sta_zp(rngseed5tmp);
    lda_zp(rngseed5tmp + 1);
    adc_zp(rngseed4tmp + 1);
    sta_zp(rngseed5tmp + 1);
    lda_zp(rngseed5tmp + 2);
    adc_zp(rngseed4tmp + 2);
    sta_zp(rngseed5tmp + 2);

    ldy_imm(4);
    jsr(shft3right);
    lda_zp(rngseed5tmp + 1);

    ply();
    plx();
    rts();


    // initialize seed 
    // input: rngseed1,rngseed2
    // output state in:
    //     rngstate1-rngstate3, rngcarry1-rngcarry3,rngi1-rngi3, rngs1-rngs3, rngcache
    // uses all registers
seedrandom.here();
    jsr(conditionshiftseedgen);
    ldx_imm(MLEN1);
    ldy_imm(0);
t5.here();
    jsr(seedshiftreggen);
    sta_aby(rngstate1);
    iny();
    dex();
    bne(t5);
    ldx_imm(MLEN2);
    ldy_imm(0);
t6.here();
    jsr(seedshiftreggen);
    sta_aby(rngstate2);
    iny();
    dex();
    bne(t6);
    stz_ab(rngi1);
    stz_ab(rngi2);
    stz_ab(rngcache);
    //initializing the carries isn't necessary but except for testing
    lda_imm(0xff);
    sta_ab(rngcarry2);
    stz_ab(rngcarry1);

    lda_imm(MLAG1);
    sta_ab(rngs1);
    lda_imm(MLAG2);
    sta_ab(rngs2);

    //x is still 0, do 256 warm up
    ldx_imm(0);
t7.here();
    jsr(rndbyte);
    dex();
    bne(t7);

    rts();
    //the important random function
    //generate one random byte 
    //output in a
    //changes state in 
    //     rngstate1-rngstate2, rngcarry1-rngcarry2,rngi1-rngi2, rngs1-rngs2, rngcache
    //  does not effect x and y
rndbyte.here();

    phx();
    dec_ab(rngi1);
    bmi(a2);
    dec_ab(rngs1);
    bmi(a3);
a1.here();
    lsr_ab(rngcarry1);
    ldx_ab(rngs1);
    lda_abx(rngstate1);
    ldx_ab(rngi1);
    adc_abx(rngstate1);
    sta_abx(rngstate1);
    rol_ab(rngcarry1); //save the carry
    eor_ab(rngcache);

    plx();
    rts();
a2.here();
    lda_imm(MLEN1 - 1);
    sta_ab(rngi1);
    dec_ab(rngs1);
    bpl(a1);
a3.here();
    lda_imm(MLEN1 - 1);
    sta_ab(rngs1);
    dec_ab(rngi2);
    bpl(a4);
    lda_imm(MLEN2 - 1);
    sta_ab(rngi2);
a4.here();
    dec_ab(rngs2);
    bpl(a5);
    lda_imm(MLEN2 - 1);
    sta_ab(rngs2);
a5.here();
    lsr_ab(rngcarry2);
    ldx_ab(rngi2);
    lda_abx(rngstate2);
    ldx_ab(rngs2);
    sbc_abx(rngstate2);
    ldx_ab(rngi2);
    sta_abx(rngstate2);
    rol_ab(rngcarry2); //save the carry
    sta_ab(rngcache);
    bra(a1);
 
Last edited:
  • Like
Reactions: BruceMcF

BruceMcF

Active Member
May 19, 2019
150
49
28
It just occurred to me that a BIT can give direct sequential testing of the high four bits, if it's known that there is a bit set among the high four bits.

It can give five bytes with a shift (to use the carry flag), but the extra codespace overhead is seven bytes, so faster to make the "small n" table 16 bytes long rather than eight bytes long at a net cost of only one byte:

Code:
    STA temp
    AND #$0F
    CMP temp
    BNE flagged
    TAY
    LDA wmtable,Y
    RTS
flagged:
    LDA #%00100000
    BIT temp
    BMI +++
    BVS ++
    BNE +
    LDA #%00011111
    RTS
+  LDA #%00111111
    RTS
++ LDA #%01111111
    RTS
+++
    LDA #%11111111
    RTS
wmtable:
    !byte $00,$01,$03,$03,$07,$07,$07,$07
    !byte $0F,$0F,$0F,$0F,$0F,$0F,$0F,$0F
 
Last edited:

JoshuaScholar

New Member
Nov 6, 2019
24
10
3
I didn't know you could get 4 bits with one BIT, the code in what I wrote is similar but I got 3 and used a 31 byte table.
My whole routine formatted to be similar to yours mine looks like this. So I could reorder it to get rid of 16 bytes of lookup table hmm
A quick calculation is that if I coded it like yours I'd save 16 bytes on table but spend 11 bytes more on code, so I'd only save 5 bytes. It would save 1 cycle on numbers under 16 but take 7 cycles more for numbers over 15 (8 more for numbers over 31). I'm not sure whether I'll make that change.
Code:
    TAY
    BEQ b1 ;if the range is 0 to 0 then just return 0
    STA rngtmp
    INC A
    BEQ rndbyte   ;if the range is up to 255, no need for testing, just call rndbyte
    STA rngtmp2 ;number to compare against after rndbyte is incremented because there's no branch less than
    LDA #$20
    BIT rngtmp  ;bit is clever because it tests three things, top bit, second bit and third bit at once
    BMI mff   ;high bit, mask is ff
    BVS m7f  ;second bit mask is 7f
    BNE m3f  ;third bit mask is 3f
    LDA rnd_mask_table,Y
b4:
    STA rngtmp
b3:
    JSR rndbyte
    AND rngtmp
    CMP rngtmp2
    BCS b3
    RTS
b1:
    LDA #0
    RTS
mff:
    LDA #$ff
    BRA b4
m7f:
    LDA #$7f
    BRA b4
m3f:
    LDA #$3f
    BRA b4
rnd_mask_table:
    !byte $00,$01,$03,$03,$07,$07,$07,$07
    !byte $0F,$0F,$0F,$0F,$0F,$0F,$0F,$0F
    !byte $1F,$1F,$1F,$1F,$1F,$1F,$1F,$1F
    !byte $1F,$1F,$1F,$1F,$1F,$1F,$1F,$1F
 
Last edited:

BruceMcF

Active Member
May 19, 2019
150
49
28
I'm still too nmos 6502 oriented, it's faster & shorter in the first part with the 65c02 BIT# and two clocks faster and the same size in the second part (which would otherwise be "STA temp: LDA #%00100000: BIT temp").

It would be even faster if BIT # tested bits 7 and 6 of the accumulator in the sign and overflow flags, like the other bits test bits 7 and 6 of the data byte, but you can't have everything ...
... switching from the 16byte table version below to a 32 byte table version of the same process costs net 25 bytes, saves 4 clocks in the n=32-63 case and I think about 8 clocks in the n=16-31 case:

Code:
    BIT #%11110000 ; #$11100000 for 32 byte table
    BNE flagged
    TAY
    LDA wmtable,Y
    RTS
flagged:
    BIT #%10000000
    BNE +++
    BIT #%01000000
    BNE ++
    BIT #%00100000 ; omit for 32 byte table
    BNE +         ; omit for 32 byte table
    LDA #%00011111 ; omit for 32 byte table
    RTS           ; omit for 32 byte table
+  LDA #%00111111
    RTS
++ LDA #%01111111
    RTS
+++
    LDA #%11111111
    RTS
wmtable:
    !byte $00,$01,$03,$03,$07,$07,$07,$07
    !byte $0F,$0F,$0F,$0F,$0F,$0F,$0F,$0F
; 32 byte table
;    !byte $1F,$1F,$1F,$1F,$1F,$1F,$1F,$1F
;    !byte $1F,$1F,$1F,$1F,$1F,$1F,$1F,$1F
 
Last edited:

JoshuaScholar

New Member
Nov 6, 2019
24
10
3
As I said on the facebook post https://www.facebook.com/groups/CommanderX16/permalink/402529400498160/

I think I found some improvements... pending. I had originally found some references to add with carry and subtract with borrow versions of Lagged Fibonacci generators being better (and obviously they'd fix the low bit problem) so I used that.. But the assumption that the length and lag should be the same as one without carry was wrong. There's only one paper on this.

I think I'll be able to make the state shorter and maybe even simplify it to one generator, testing that now.

Lol you look at the paper and I have to find r and s such that 256^r+256^s-1 is prime (read as "256 to the power of r etc."), and pick them such that the multiplicative order of 256 with that prime is low. You'd think that's hard, but tools are so good now, if you have a copy of Maxima you can write a short program that will give you that. I'm testing something that won't take as much memory.

To make you feel better about using these tools to engineer your retro system, actually Maxima was originally Macsyma written in the late 1960's at MIT. It's really a very old computer algebra system. It's just a computer that runs it no longer costs millions of dollars. I read about it when I was in high school in the 80's and really wanted to be able to play with it. Now I can!

By the way, here's the paper https://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoap/1177005878
 
Last edited:
  • Like
Reactions: BruceMcF

JoshuaScholar

New Member
Nov 6, 2019
24
10
3
Whether this is a result of following the mathematical advice for the parameters which give a maximally long cycle or just from trying things out, I did manage to cut the memory needed by the state in half and I ran Big Crush 11 times on different seeds and it passed every test. Then I remembered to try this and I'm running a few tests bit and byte reversed too.

Changed both generators to add with carry (saves instructions and matters for settings) and set the length/lag to 44/14 and 24/23
It was originally 71/65 and 55/24 so this saves 59 bytes state plus 4 bytes of instructions
 

LRFLEW

New Member
Sep 23, 2019
19
4
3
I've been away from the forum for a bit (been busy), but saw this thread and felt like I should chime in.

I had found a fairly efficient LCG in the cc65 source that I found really clever. Put simply, it used the fact multiplying a 32-bit value by $01010101 can be done by a series of byte-wise additions. I realized that if I changed the addition constant to a repeating byte pattern, I could merge the addition and multiplication steps together, and perform the LCG operation with only 4 add instructions.

You can retrieve a 16-bit random value in X:A using this method as follows:

Code:
rand:   clc
        lda     rand+0
        adc     #$B3
        sta     rand+0
        adc     rand+1
        sta     rand+1
        adc     rand+2
        sta     rand+2
        tax
        adc     rand+3
        sta     rand+3
        rts
One advantage to this method is that you can trivially increase the size of the state by adding more instructions of the form adc rand+N ; sta rand+N.

The code in cc65
The pull request I made for the current implementation

EDIT: Spent some time reading the rest of this thread. It looks like the work being done on the Fibonacci generator is useful for generating higher-quality random numbers. This method has a much shorter cycle length, and is much more predictable. However, I doubt you can find a faster construction for an RNG. I'm using this algorithm for a program (that I should really get around to releasing) that requires filling the screen with random noise repeatedly, so I'm using this algorithm for its speed.
 
Last edited:

JoshuaScholar

New Member
Nov 6, 2019
24
10
3
I just tried a version of your rng...
It failed smallcrush pretty badly.
I do know that someone had a 64 bit LCG (based on 128 bit multiplication and truncation) that passes bigcrush.

You know mine isn't that slow, if you put 3 bytes of it on zero page then it's 69 cycles average.
 
Last edited:

LRFLEW

New Member
Sep 23, 2019
19
4
3
I expected it to fail Crush (and variants). LCGs are not great PRNGs to begin with, it's a power-of-two modulus, and the constants don't produce the best values. My guess is that the LCG that passed involved values for m, c, and a that would be impractical for the 6502. This PRNG is fast, however, so it's useful for instances that require quick-and-dirty entropy, such as "AI" behavior for games.
 

JoshuaScholar

New Member
Nov 6, 2019
24
10
3
I also wrote routines that find a uniform random int from 0-n both for n is a byte and n is a 2-byte-word.