ruud.baltissen_at_abp.nl
Date: 2002-04-17 13:37:09
Hallo allemaal,
I simply forward some other emails regarding Gideons FPGA implementation of
the 6510. At the end you find a small history and some more ins and outs.
============================================================================
==
|How about the illegal instructions? As I recall previous postings at least
|that time there were no plans to implement them.
In this version, almost all 6502 illegal opcodes have been implemented, too.
Most of them come forth of 'poorly' decoding the instruction, such as LAX
and SAX. The opcodes ending with $3 and $7 and $f (and some $b) also act the
same as on the 6502, because it just implies a 'wrong' order of the internal
states to be taken. Some other "illegal" opcodes, like the places where STX
$nnnn,Y and STY $nnnn,X should have been were called illegal because they
didn't work in the original 6502. In this FPGA implementation they do work,
so on those opcode places, you'll find STX $nnnn,Y and STY $nnnn,X, and of
course the load variants as well. IMHO it doesn't matter that these opcodes
act a bit differently from the original 6502 since these were not stable.
Anyone who is interested in testing all opcodes: you're welcome. I just
don't know how to get an FPGA board to you to test them. Maybe I will ahve a
few of them made.
|Does this implementation also enable weird stuff like putting bytes on the
|bus at certain times to write to areas normally unaccessable? I mean writes
|to RAM $00/$01 which is more stable on the later C64s.
I don't have any demos, so if you'd like to have the result of some tests,
then please send me the 5.25" disk with the demo :)
What locations $0 and $1 are concerned; in my implementation the reads are
always from the local PIO registers, and the writes go to the PIO registers,
but also to the bus, so the rest of the system *does* write those bytes into
RAM. I am not sure if this is the case with the original 6510. Anyway -
reading the RAM locations $0 and $1 by using sprite collisions etc, doesn't
have anything to do with the CPU, since you are reading it through the VIC,
so that should work.
What some illegal opcodes are concerned; in my last post I wrote that the
'unstable' opcodes of the original 6502 do not work the same on my 6510
implementation. This is true, since Nathan pointed out that there were only
2 that were not stable, I have to broaden this definition a bit. In this
6510, the ones that had a very unusual meaning and hard to comprehend (like
the high address byte + 1 anded with some other value, blah blah), *those*
will all work differently. Opcodes $x3, $x7, $xF will do the same as on the
original chip; guaranteed! So will the opcodes that select A and X together;
LAX and SAX.
Some other opcodes that did nothing but a "read from the bus" in the
original 6502 now do something. Examples:
5C: JMP $nnnn,X
34: BIT $nn,X
3C: BIT $nnnn,X
04, 14, 0C, 1C: Similar to BIT, but than with OR instead of AND
These came for free by 'loosening' the decoding a little.
That the timing is concerned; there are some differences. From the top of my
head:
* branches take 2 cycles untaken, 4 taken, no matter if the page boundary is
crossed or not.
* implied instructions always take 1 cycle instead of 2 (TAX, CLI, etc)
* RTS and RTI take one cycle more
* Additions/subtractions in decimal mode are less buggy and take one
clockcycle more.
* In read/modify/write instructions, the wrong value is not written first,
like what was the case on the 6502.
I hope that this gives some more clarity about what the implementation looks
like.
============================================================================
==
History:
Gideon contacted me in private because, being a C64 fan and working with
FPGA's, he had the idea of building a C64 in FPGA. He searched the net and
hit my sit so often :) and being Dutch aswell, he decided to contact me.
I told him that Jeri was working on the C=1 so in fact he would be inventing
the wheel again. On the other side Jeri had to use the 65816 as there was no
free (good) core for the 6502 and 65816.
So Gideon decided to shift his attention to the processor by producing a
better CPU then the 65816. One that actually can replace the original 65816
on the C=1 but also the 6510 or 6502 on other C= computers (just a matter of
another interface). We are aiming at a 32 bits CPU with some extra's running
at 32 MHz. Maybe one that has a 65816-mode. But one that can run 6502-code
any time !!!
About the illegal opcodes:
Gideon and I are still discussing what to do with them. Let's have a look at
the 65816. It has only ONE opcode (WDM / $42) left that can be used for
extending the instructionset. This would mean we would end up with 3 byte
instructions. Using illegal opcodes means we still will have some two byte
instructions but 3 byte ones.
Then the facts:
1) who is using illegal opcodes? AFAIK mostly demo's with no other reason
than to gain some extra microseconds.
2) Users having a SCPU cannot play these demo's anyway as the 65816 won't
recognise the instructions as meant (and therefor can crash).
About the differences in timing:
Gideon could change the design so the 65GS10 would work exactly like the
original 6510. Adding an extra cycle is no problem, but reducing the extra
ones is. All operations inside the FPGA are done at the rising edge of the
clock. Doing some operations at the falling edge as well would do the trick.
But then you end up with operations that sometimes have to activated on the
rising edge and other times at the falling edge. And the combination is the
problem as (for the moment) the solution costs too many gates compared to
the gain.
What programs really depend on these timings? Mostly demo's and games. As I
said before, we are aming at a 32 MHz CPU. Running the CPU at any other
speed then the original frequency would screw up this game/demo anyway. IMHO
then those few clockcycles won't make the difference anyway.
What about the extra speed for games? SCPU-users ran into this problem
allready I think. (I don't have one, so I cannot tell) In fact I think we
will run into the same problem with a lot of games as we had with the PC's
at the end of the 80's: many games only ran fine at PC's equiped with a 8088
running at 4.77 MHz. (This is IMHO the only reason why PC's were equiped
with a "Turbo-button")
I don't see any reason why the 65GS32 could not run at 1 MHz. I wonder what
game would drop dead on the fact the some instruction aren't time exact.
(Hmmm, a "single Stepper" inside a monitor could)
About the extra's:
- The 65GSxx is capable of addressing SDRAM's directly. This feature is
needed so the 65GSxx can run at those high speeds.
OK, this isn't a feature you would expect of a CPU but it is "build" inside
the same FPGA and therefor considered as part of the CPU. Same comment for
other extra's.
- A Memory Management Unit. Those people familiar with a SCPU immediatly
know why we need this device. The VIC cannot "see" the SDRAM in any way. So
the 65GSxx MUST write video-data to the original RAM of the C64. The MMU
enables us to tell the CPU wether to use the original RAM or the SDRAM.
A special instruction will replace the Zeropage with a set of registers. A
simple loop like:
ldx #0
L1 lda ROM,X
sta $00,X
dex
bne L1
could fill these registers from ROM or whatever other source.
- The CPU is going to be equiped with 32 (?) 32 bits general purpose
registers. This means we could perform instructions like "LOAD R1, ($12),R3"
but also "LOAD R3, ($12),R1". The idea is to dedicate (part of) the
registers to the well known standard registers of the 6502. So "LDA ($12),Y"
will in fact do the 8-bits version of the above "LOAD R1, ($12),R2".
We also need more instructions. LDA (or LDAB) loads a byte. LDAW will load a
word, LDAD a double-word. LDAx (or LDAx16) uses a 16-bit address, LDAx24 a
24-bits address, LDAx32 all 32 bits.
In this way it is easy to extend the existing instruction. A problem will be
the sheer mass of possible combinations. What about all possibilities with
the instruction "LDA ($xx),Y"? This command allone has 36 possible
combinations !!!
Another problem The 16- and 24-bit address instructions are another problem:
what about the unused higher addressbits? One idea is to make them zero.
Another idea is to dedicate a register to these instructions to fill in the
remaining bits. In this way we can run several virtual 6502-processes
parallel to each other.
-new instructions:
This is a matter of gains and costs. Using the ADC instruction the first
time means we need to (re-)set the Carry-flag. The 80x86 has the ADD
instruction that does the addition with disregarding the Carry. In our
opinion we can do without this instruction as the gain is marginal.
The 6502 has no block-instruction. The 65816 has: MVN and MVP. With X
varying from 8 to 32 bits, this loop:
ldx VALUE
L1 lda HERE,X
sta THERE,X
dex
bne L1
can replace such a blockfunction. But I figured out that a blockfunction
could move a double-word every 2 cycles against 6 for the above loop. This
is a gain of 200% but then: what is the over-all gain? I could be wrong but
a compiler does not benefit from this gain, a text-editor could.
I can hear some of you think: 6 cycles ???? Yep :)
- cache with onboard allignment and pipelining
"LDA ..." and "STA ..." are 6 bytes each: 2-byte instruction, 4
addressbytes. "DEX" one byte, "BNE L1" two bytes. Total: 15 bytes = 4
cycles. Add two cycles for the actual read and write and you have 6.
Future:
Gideons idea is start with the SDRAM interface and MMU first. Without the
there is no good way in testing any 32 bit extensions.
___
/ __|__
/ / |_/ Groetjes, Ruud
\ \__|_\
\___| http://Ruud.C64.org
Message was sent through the cbm-hackers mailing list
Archive generated by hypermail 2.1.4.