Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

j1a8k doesn't use all available ram #42

Open
bmentink opened this issue Oct 3, 2016 · 11 comments
Open

j1a8k doesn't use all available ram #42

bmentink opened this issue Oct 3, 2016 · 11 comments

Comments

@bmentink
Copy link

bmentink commented Oct 3, 2016

Looks like ram is still 8K, I believe the ICE40HX8k fpga has 32K?

@RGD2
Copy link
Contributor

RGD2 commented Oct 4, 2016

There are unused ram blocks, yes.
I was thinking of using some to implement the deeper parts of the stacks
for the j4a, to try to free up some of the LUT's. (The big pipelined
quad-stack structure in the j4a seems to be eating the most area).

I've also been thinking on and off of a 24 bit Jx varient to
directly allow for an even 20 bits of ram address: using 24x 1M (or 16x1M +
8x1M, or 3x8x1M) in external sram chips. Unfortunately, it quickly gets
expensive. And although I like the concept - I've never actually run out of
ram with any practical applications of the j4a. Yet, at least.

So yes, there is ram free - afaicr about half the ram in the 8k is unused -
but using it as directly accessible ram requires changing the instruction
format, or increasing the data word width.

I'd prefer to leave it free to implement some application-specific hardware
fifos for exchanging data between threads or between peripherals. (Or
perhaps crossing clock domains).

If I had SDRAM on board, I'd just use it as a very deep FIFO on the way to
a USB 2.0 Hi-Speed Fifo interface chip. (They work much better with at
least 8 MiB or so of fifo, rather than the 2-4kiB you usually get).

But doing that also requires srams available to muster data for bursts
into/out of the SDRAM chip, or for covering its refresh cycle
unavailability.

I have used that sort of setup to maintain continuous 30MB/s captures from
banks of ADC for upto about 8 hrs at a time. (It seemed to work ok for
about 99 hours during testing. But I only had 11 TiB of disc space...)

I'm really not at all keen on using SDRAM for system RAM. Too much latency,
even with cache. It makes the performance of the whole thing inconsistent.
J4a is really useful to me precisely because it is consistent. If I need
much ram or CPU power, I'll just plug in an SBC or PC, and transfer the
data there for processing.

That the j4a only half-fills a 8k chip gives it a lot of flexibility. I
like to deploy it with a Linux SBC handy to keep the toolchain accessible.
(And to give it easy network-accessibility).
This means I'll tend to write verilog peripherals to plug in to suit the
specific application. At the moment I have one with two
differently-configured SPI hardware peripherals. The faster one runs at 20
MHz, and either can be used in word or byte mode. (The slower 10MHz one
does byte swapping because it's used to talk to a CAN bus controller, and
CAN bus data is little-endian). I intend to add the peripherals as open
source, but this is kinda where keeping git branches (and sub-branches) of
swapforth starts making sense.
Although it is reasonable I think to push the peripheral module files
upstream - just leave them unconnected, except in the downstream branch
they originate in.

@bmentink
Copy link
Author

bmentink commented Oct 4, 2016

@RGD2

Thanks for the explanation. I agree sdram is probably not the ideal. I would prefer more fpga ram being freed up, even if it addressed separately.

My current requirements are to implement a fast DDS circuit in verilog (interfaced to swapforth), so I need a 512x10bit lookup table for arbitary waveforms. I don't want to use up any of the remaining 3k of ram for that if possible, as I want to write substantial Forth programs too ..

By The way, can you explain how to use the current ram for that? I don't understand the ram.v module at all, or how the python program generates it from the .hex file ...

Also, is there any documentation for the j4a cpu? I am guessing that it has 4 x simultaneous hardware threads? I can't find any description of it .. may be useful to me as well.

Having open source shareable verilog modules is a great idea. I have some PWM modules with pre-scalers etc I could share as well.

@RGD2
Copy link
Contributor

RGD2 commented Oct 5, 2016

The j4a only has four sets of stacks - they round-robin through the ram and
alu, which prevents concurrency problems.

The arrangement means the alu could be pipelined, and that's where the
speed up would occur. But it would likely never run one thread any faster
than a j1a, because of the critical delay path being ultimately just as
long.

At the moment, each thread runs 1/4 as fast as a j1a, but pipelining should
allow increasing the clock rate to 160 MHz, which would make each thread
Exactly As fast as a J1a. The ram is capable of much higher rates, so it's
not the bottleneck, and the stacks can also be pipelined, so they shouldn't
be either. Ditto for the io interface, which if pipelined could accomodate
a proper address decoder to allow much increased IO space for hanging
peripherals off of, at the expense of a little more latency.

The j4a is compatible with the j1a, in the sense that one can use the j1a
simulator (make bootstrap) to compile swapforth for the j4a.

@bmentink
Copy link
Author

bmentink commented Oct 5, 2016

Got it, thanks. Having 4 threads run as fast as the j1a would be awesome!
Having the clock at 160Mhz would help me too, as I would like to clock other fpga peripherals at greater than 100Mhz ..

Did you have any thoughts about my waveform table in fpgs ram?

@RGD2
Copy link
Contributor

RGD2 commented Oct 5, 2016

I'm currently doing something similar - except using the j4a to spit out
samples to a SPI connected DAC, but only at 2 kHz.
(Could have gone much faster of course, but that was the spec I was given. )

I just copy paste the values from a spreadsheet, where the column had been
formatted int "$xxxx , " (without the quotes, but note the spaces). Then
drop them into a text file I can #include.
The comma is a forth word which compiles what was on the stack to the end
of the dictionary. Use the forth word create with a name first to put a
word of that name down which gives you the address of the start of the
array. Then immediately do the #include. I have it scripted: you can put
#include in files you #include, and it works as you expect. (This is how I
do a "clean"
Build of an app : usually using make sim_connect to run the top level
include so I can avoid having to build the FPGA more than once, or having
to simulate the full j4a).

I also keep track of the number of values with a constant, and the code
that accesses the data just does pointer arithmetic then a @.

There are other, more advanced ways to do this with forth.
Don't forget you could load/reload at runtime as well : one idea is to
include a raspi to run the UI as a web app, and have it load snippets of
forth at run time.
(Thus avoiding needing everything to fit in j1 ram at all times.).
YMMV though.
Don't forget the icoboard exists, and has SRAM on board.... It would
happily sit on a Pi.

-- Remy

@bmentink
Copy link
Author

bmentink commented Oct 5, 2016

Hi,

Yes understand doing it in Forth. But, I want to spit out samples at 48Mhz
so have to use verilog ..

I would like to know how to format the data to include in the ram arrays
the same way the forth code has bben included in the binary image, but I
don't understand how the mkrom.py creates the data that is included in
ram.v. (not sure how the data is split among the addresses cells) ... not
too familiar with python.

Any help there would be great ..

On Wed, Oct 5, 2016 at 5:30 PM, RGD2 [email protected] wrote:

I'm currently doing something similar - except using the j4a to spit out
samples to a SPI connected DAC, but only at 2 kHz.
(Could have gone much faster of course, but that was the spec I was given.
)

I just copy paste the values from a spreadsheet, where the column had been
formatted int "$xxxx , " (without the quotes, but note the spaces). Then
drop them into a text file I can #include.
The comma is a forth word which compiles what was on the stack to the end
of the dictionary. Use the forth word create with a name first to put a
word of that name down which gives you the address of the start of the
array. Then immediately do the #include. I have it scripted: you can put
#include in files you #include, and it works as you expect. (This is how I
do a "clean"
Build of an app : usually using make sim_connect to run the top level
include so I can avoid having to build the FPGA more than once, or having
to simulate the full j4a).

I also keep track of the number of values with a constant, and the code
that accesses the data just does pointer arithmetic then a @.

There are other, more advanced ways to do this with forth.
Don't forget you could load/reload at runtime as well : one idea is to
include a raspi to run the UI as a web app, and have it load snippets of
forth at run time.
(Thus avoiding needing everything to fit in j1 ram at all times.).
YMMV though.
Don't forget the icoboard exists, and has SRAM on board.... It would
happily sit on a Pi.

On Wednesday, 5 October 2016, bmentink [email protected] wrote:

Got it, thanks. Having 4 threads run as fast as the j1a would be awesome!
Having the clock at 160Mhz would help me too, as I would like to clock
other fpga peripherals at greater than 100Mhz ..

Did you have any thoughts about my waveform table in fpgs ram?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/jamesbowman/swapforth/issues/
42#issuecomment-251565195>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO8-
GFZx9F4xiLCbhWAqZwIpufOIFFSXks5qwwh4gaJpZM4KM2kw>

.

-- Remy


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#42 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJp6h9-d61skRUydqmiwJMUslHxw5Fw1ks5qwyfYgaJpZM4KM2kw
.

@RGD2
Copy link
Contributor

RGD2 commented Oct 5, 2016

Hmm, might have to ask James, I don't really either.
But it's a bit of a kludge: it's possible now to change block SRAM in a
FPGA bin file directly with icebram - it wasn't around when mkrom.py was
written, and it's much quicker to change the bin file than it is to
recompile the whole thing.
So, I'd suggest looking into that first.

@bmentink
Copy link
Author

bmentink commented Oct 5, 2016

Thanks, will look into icebram. I also thought of creating the data as 16bit hex words, swapping the bytes, and adding/replacing the last block of values in nuc.hex (top of ram) with the contents. My Verilog module would then spit out the block to a 16bit DAC, I have some Verilog DDS code to do all that ..

@bmentink
Copy link
Author

Any further idea's how I can implement the sine table I need in FPGA? I have no idea how to address it from verilog, even if I put it top of current RAM.

@RGD2
Copy link
Contributor

RGD2 commented Nov 25, 2016

Write a little verilog machine to generate it as you will without swapforth entirely - starting from one of the example designs.

Then, when that works as expected, modify j1a.v to include the thing and include controls to allow your j1a instance to control it via io! and io@ .

The block rams you'll add will be completely separate from the j1a's ram. See lattice's documentation for how the different EBR primitive blocks work in verilog.
There's a primitive for 256 words of 16 bits, if you want more than that, you'll have to combine multiple blocks together. eg: two 512x8's which each get the same address inputs, and whose outputs are concatenated to give your your 16bit data.

Obviously, combined this way, you'll have to 'deinterlace' your sine table into two 512 entry 8 bit tables - and then put them in the relevant spots. You could just use some dummy values in the verilog init streams, and then use icebram to extract all brams' from the fpga .bin file, so you can figure out which block is which, then use python to put the proper wavetables back into a format where icebram will accept them, to overwrite what's in the .bin fpga bitfile.

@Mecrisp
Copy link

Mecrisp commented Jan 18, 2017

I found a way to address the whole RAM available in HX8K - you can move the "fetch" bit out of the address space directly accessible with call/jmp/jnz and do a RAM fetch explicitely by or'ing/add'ing the high fetch bit and pass it to execute. Ok, it will render the sequence "variable @" longer, not only a single high-call opcode, but a literal and a "call execute", however, this is compensated by double the amount of RAM available. Regarding your other questions, I wired in the 16x16=32 multiplier as two different opcodes for the low and hight 16 bit part of the result. I also replaced do-loop with a stack only version, as James original variant involving a local variable was not interrupt safe. You can see my modifications to Swapforth in the current Mecrisp-Ice package on mecrisp.sourceforge.net
Best wishes, Matthias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants