Continue to Site

Welcome to our site!

Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

  • Welcome to our site! Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

Need help - Filter FIR

G3R0

New Member

Objective:​

Design an FIR Filter to compute the output
[] based on the delayed samples (₀, ₁, ₂, ₃) and the coefficients (₀, ₁, ₂, ₃) using the following equation:

Y[n]=S0⋅A0+S1⋅A1+S2⋅A2+S3⋅A3
Y[n] = S₀ ⋅ A₀ + S₁ ⋅ A₁ + S₂ ⋅A₂ + S₃ ⋅ A₃
Y[n]=S0⋅A0+S1⋅A1+S2⋅A2+S3⋅A3
The filter must adhere to the following hardware constraints:


Constraints:​

  1. Single Multiplier:
    All partial products ([] ⋅ []) must be computed sequentially.
  2. Single Adder:
    Partial products must be added to the accumulator in stages, reusing the same adder.
  3. Sequential Processing:
    The computation must be completed in 4 clock cycles, processing one stage ([] ⋅ []) per cycle.
  4. Control Logic:
    Control logic is required to sequentially load the samples [] and coefficients [] into the multiplier. Additionally, it must manage the accumulator and generate the output after the 4 cycles.
  5. Shift Register:
    The delayed samples (₀, ₁, ₂, ₃) must be managed using a Shift Register that automatically updates the values when a new sample [] is input.

System Design:​

  1. Data Input:
    A new sample () is input into the Shift Register, which updates the delayed samples (₀, ₁, ₂, ₃).
  2. Partial Product Calculation:
    The multiplier sequentially takes each delayed sample [] and coefficient [] to compute the partial product.
  3. Accumulation:
    The adder adds the partial product to the accumulator ().
  4. Final Output:
    After 4 cycles, the accumulator contains the output [].


I really need help with this project. I’ve been researching this topic for several days because in my course, we haven’t covered FIR filters, and I’ve been tasked with completing this assignment without having the necessary background knowledge. I’ve started working on a diagram, which they’ve asked to be “superficial,” meaning I don’t need to provide perfect details about the internal workings of each component. I’ll include an image of what I think should make up the system's diagram.

The constraint I have for designing my FIR filter is that I can only use a single multiplier. As it must consist of 4 stages, I need to design it to operate across 4 clock cycles.

This is the 4-stage FIR filter they told me about:
1733570536809.png


And this is what I’ve come up with so far, but I’m not entirely sure about it:

1733570574374.png
 
You are on the right track. FIR very simple architecture :

1733576906675.png


So you have a delay line (memory) where each sample is input (after) shifting
all the prior samples (N samples) and recompute the sum of all the taps. The
b's above are the coefficient values)

Regards, Dana.
 
You are on the right track. FIR very simple architecture :

View attachment 147961

So you have a delay line (memory) where each sample is input (after) shifting
all the prior samples (N samples) and recompute the sum of all the taps. The
b's above are the coefficient values)

Regards, Dana.

So basically, that would be the design, right?
Or am I missing any other "component"?

Thank you so much for the help and for responding! :)
 
Of course you need A/D front end, and a LPF filter to masnage Nyquist related issues.

You can do the filter in code and or HW.

Your adder wants to be an ALU of some sort.

Take a look at old time 2900 family ALU, it was quite the rage (speed) before
MOS bloomed in capability.



1733590324837.png


Or use a single chip that has all the elements and routable (simple with wizard).
It also does have an ALU HW block in it, and happens to have a real capable
digital filter, FIR and IIR, but you can ignore that and using basic logic elements
and/or Verilog do a design. Has the A/D, and analog stuff. Following is example
of whats on chip, multiple copies in many cases of these elements. PSOC 5LP

The datapaths in parts (in each UDB) , they are cascadeable (there are 24 UDB's in the bigger parts) :

1733590498442.png


1733587997524.png


As stated earlier you can use Verilog to create your own elements, or schematic capture.
Community has done quite a few (apart from the above standard lib). Like DDS, CORDIC,
CPLD, 74HC equivalents (a few)....

The board is ~$15, CY8CKIT-059, IDE and compiler free, PSOC Creator.

Regards, Dana.
 
Last edited:
Note there is a community designed block, 24 bit, that implements a MAC, that can be
implemented inside the PSOC 5LP -

1733594389624.png


Note in the PSOC 4 family, there is a 32 bit single cycle multiplier in it. This
family is a subset of the 5LP mentioned prior.


Regards, Dana.
 
Thank you so much, danadak . All of this was very helpful. If I encounter any issues, I’ll come back again. I really appreciate your time and kindness.
Note there is a community designed block, 24 bit, that implements a MAC, that can be
implemented inside the PSOC 5LP -

View attachment 147970

Note in the PSOC 4 family, there is a 32 bit single cycle multiplier in it. This
family is a subset of the 5LP mentioned prior.


Regards, Dana.
 
After several days, I managed to make some progress. Obviously, it's on shaky ground, as the only thing that has helped me are the ideas I've seen on this forum. To be more specific, my task is to accomplish what the statement says at the beginning and, in addition, create a state machine, an implementation using logic gates (this being the biggest problem, as my professor only told me that I have to imagine that I must assume the existence of a multiplier and an adder component). I don’t understand how the connection or link I need to establish with the gates should work. I fail to see the relationship between the FIR Filter and what he mentioned and/or the logic gates.

On the other hand, I also have to create a set of instructions in the style of a processor.

I’m sharing everything I’ve done so far. If anyone happens to see and read this, I would deeply appreciate your time and help, as I feel completely alone in this journey.

State Machine:
Microprogrammable System:
1733922397355.png


Control Unit:
1733922416672.png


Mealy Machine:
1733922432765.png



Processing Unit:
1733922305946.png



FIR Filter with Constraints:

1733922480613.png
 
The basic process is acquire a set of samples to fill the filter # of taps.
Then each time a new sample comes in its added to the memory and
the oldest sample discarded. Additionally each time, after this memory
task is done, the sum of all tap samples times their respective tap weights
(filter coefficients) is computed and that becomes the output at that point
in time for the filter.

One can do this entirely in HW, with a memory controller to effect the "psuedo
shift" in time each new sample produces. Or in software with a array, a pointer
into array, that writes the new sample (overwriting the oldest sample) and handling
wrap around in code when last element in array has been written. DMA for more
sophisticated design can manage sample storage/position in array, or use code.

So for an N tap filter a cycle is take in sample, perform N multiplies of coefficients
and samples, all doing multiply and accumulate operations, then write result
out.

Regards, Dana.
 
Last edited:
Key is a circular buffer for incoming samples :


So overall operation is :

1734003701926.png


So every time a sample is taken in N multiplies of the coefficient memory
and respective sample occur, and a running sum of each of these multiplies
is done to produce the output.

So we have a fixed memory of N coefficients, a circular buffer of N samples,
and the process is run, sample in>>N multiplies summing each multiply of
sample and coefficient > rinse and repeat. Notice then that each sample, over
N clocks, is subjected to each coefficient in the MAC operation process. Then
that oldest sample "thrown" away, and new one taken in and whole N taps
recomputed to get output.
 
Last edited:
After reviewing and rethinking what you sent me, danadak , I’ve made some modifications.

I’m sharing screenshots of how I’ve designed the microprogrammable system, the Mealy automaton, the processing unit, and I’ve also made changes to the instructions.
Thinking about the fact that samples will keep coming, as you mentioned, and it shouldn’t stop since it will be an "infinite" cycle, the microprogrammable system iterates by returning to state N0.

1734094626570.png


Here lies the major concern and question: are these instructions valid and correct?

mov $Acum, 0 # Initialize accumulator to 0
mov $K, 0 # Initialize index K to 0
# Coefficients h[0], h[1], h[2], h[3] are preloaded into fixed registers: $R0, $R1, $R2, $R3
# Samples x[0], x[1], x[2], x[3] are in memory and will be dynamically loaded into Rs with LDA
loop_start:
xor $K, 3 # Compare index K with 3 using XOR (if $K == 3, result = 0)
je end_cycle # If $K == 3, jump to the end of the cycle

# Load sample x[K] into Rs:
lda $x_$K # Load sample x[K] into Rs from memory

# Select coefficient h[K] from predefined registers:
mov $Rh, $R[$K] # Load predefined coefficient h[K] into Rh (using registers R0, R1, R2, R3)

# Multiplication and accumulation:
mul Rs, Rh # Multiply Rs (x[K]) * Rh (h[K]), result in reg_a
add $Acum, reg_a # Add partial product to accumulator Acum

# Increment and repeat the cycle:
inc $K # Increment index K
jump loop_start # Repeat cycle for the next sample
end_cycle:
out $Acum # Output the accumulator as Y
mov $Acum, 0 # Reinitialize accumulator for the next execution
mov $K, 0 # Reinitialize index K
jump loop_start # Return to the start to continue processing



---------------------------------------------------------------------------------------------------------------------------------

Is it correct to calculate latency in the microprogrammable system this way? I have doubts whether I should consider 1 clock cycle per state, depending on whether K or J iterates, or if, in these specific cases, it should be considered differently. In other words, is it correct to believe that there are 22 clock cycles for the entire circuit? Obviously, speaking of 1 sample, since if there are N samples, the clock cycles would be infinite.

Relationship between J and K:The index K iterates 4 times (from 0 to 3) in each cycle.The index JJJ controls the global cycle and also iterates 4 times (from 0 to 3).Therefore, in the system flow, JJJ and KKK iterate the same number of times in total.

Adjusted clock cycle calculation:
State δ0:
The initial system is set up (1 clock cycle).

State δ1:
FIR processing for each K value: Each K iteration involves 4 cycles (compare, multiply, add, increment). 4 iterations of K: 4×4=16.

State δ2:
Increment K++, reset J=0, and move to the next state (1 cycle).

State δ3:
Increment J++ and check J≤3: Each J iteration involves 1 additional cycle. Since J also iterates 4 times:
4 iterations of J: 4×1=4 cycles.


Total latency per global cycle (1 FIR sample):
Tlatency=Tδ0+Tδ1+Tδ2+Tδ3

Substituting values:
Tlatency=1(δ0)+16(δ1)+1(δ2)+4(δ3)=22 clock cycles.

What happens when adding more samples? The process would be cyclical, but the initial time (δ0) only occurs once at the beginning of the system. Therefore, subsequent samples take:

Tadditional_samples=16(δ1)+1(δ2)+4(δ3)=21 cycles per additional sample.

Conclusion:First FIR sample: 22 cycles.

Additional samples: 21 cycles each.
 
A simple check on your design is to do a hand unit impulse response, which
essentially produces a train of output values = to your coefficients.

So for a 4 stage filter, say coefficients = 1, 2, 3, 4

tZ-1Z-2Z-3Z-4Yout
000000
110001
201002
300103
400014
Coefficents 1, 2, 3, 4




Regards, Dana.
 
After verifying what you sent me and comparing it with my proposals, I made some modifications to make it work.
Now, my specific question is: I was told that I need to assume I have a "box" that represents the multiplier and another one that represents the Full Adder, and with those, I need to implement it using logic gates.
I don't understand what exactly I need to implement with the logic gates. I really need help with this because I've been stuck on it for several days. danadak
This is the only implementation I could come up with, but it's incorrect. I really don't know how to proceed.

1734356147456.png
 
Google "full adder schematic", look at image tab in browser. Here is a old TTL legacy
version :


You can cascade them.

Multipliers can be single cycle to serial versions, many variants.
Do the same here, google "ttl multiplier schematic"


Regards, Dana.
 
Google "full adder schematic", look at image tab in browser. Here is a old TTL legacy
version :


You can cascade them.

Multipliers can be single cycle to serial versions, many variants.
Do the same here, google "ttl multiplier schematic"


Regards, Dana.
Yes, of course, but he told me that I need to assume the multiplier and adder already exist. Therefore, I have to consider the Full Adder as a black box and the multiplier as a black box. I understand that I need to implement the rest, and that's where the problem comes in: what do I do with the rest? I have no idea where to apply the logic gates.
 
Note your drawing not bad. Its the essence of what you want to do.

You need logic, latches, to present the data to be multiplied and then summed.
A state machine basically. Eg. you mux, muxing latched / memory values.

So latch operands, then invoke multiplier and adder to do its thing, basically
present data, eg. clk it into a latch, then take data from output, another latch
and present it to adder to create sum. All the while keeping track of address
into sample memory to get data, from the taps delay element. Address also
identifies last element to be muled and sumed, that then triggers outputing
the sum (in memory or a latch) to the filter output. Basically a sequential
process. Usually in todays world thats implemented, all that control logic,
and intermediate values, using LUTs (FPGA/ASIC work).

Think like a calculator, write down line by line each step it takes to multiply
and sum a series of numbers. Then mentally you see what each step needs
to move onto the next step. So you have a timeline of all the steps that have
to occur.

Enter operand A
Present control signal you want to multiply
Enter operand B
Signal to complete operation (multiply, "=")
Signal to add to memory (the sum, retrieve memory, add, store back in memory)
Test if all samples done
If true recover mul-sumed memory value (filter result, output), otherwise loop again.
eg. a counter value counting the tap # you are working on or have just completed.
Use a comparator to trigger done with MAC phase.......comparing current address
into memory to get tap value with N, the #taps in filter.

In HW you can do this sequentially tap at a time, or do the whole thing in one parallel
set of operations with N multipliers, or solutions in between.


Regards, Dana.
 
Last edited:

Latest threads

Back
Top