by student André Luiz Nazareth da Costa (andre.lnc [at] gmail.com)
and mentor Timothy B. Terriberry (tterribe [at] vt.edu)
Decode a video in the Theora format requires a great power of processing.
In this way, the development of a specify hardware for it is a viable solution
and some modules had already been made successful in hardware on GSoC 2006
(Google Summer of Code). The idea is you get the FPGA with small embedded processor
and to put just the critical modules in the hardware.
Goal of my project is to give continuity to the project of the last year, putting
one or more modules in hardware and then diminishing the cpu-time processing.
This implementation will be done in VHDL and synthesized to the Altera Stratix
II FPGA. GSoC Project page: http://code.google.com/soc/2007/xiph/appinfo.html?csaid=4235040C184DBD68
The Xiph.Org Foundation (http://www.xiph.org/) is a non-profit corporation dedicated to protecting the foundations of Internet multimedia from control by private interests. The purpose is to support and develop free, open protocols and software to serve the public, developer, and business markets. Theora is the video codec from Xiph, based on the VP3 codec donated by On2 Technologies.
Google Summer of Code (http://code.google.com/soc/) is a program that offers student developers stipends to write code for various open source projects. Google works with a several open source, free software and technology-related groups to identify and fund several projects over a three month period. Historically, the program has brought together over 1,000 students with over 100 open source projects, to create hundreds of thousands of lines of code. The program, which kicked off in 2005, is now (2007) in its third year.
The First step (analysis of theora decoding process) was studied firstly by Felipe Portavales and after by Leonardo Piga. The conclusion was that the function reconrefframes waste approximately 60% of CPU-time, but functions before have a lot of struct's of decision and few struct's of processing (like multiplication). You can see this on http://svn.xiph.org/trunk/theora-fpga/doc/. Thus, this first part isn't too interesting to be done in Hardware, but the reconrefframes is.
Felipe Portavales did the iDCT and Leonardo Piga did the others functions. The VHDL simulation is OK and the synthesis in FPGA is OK too.
But, the integration was did with NIOS processor, which is a proprietary processor.
The alternative of a nonproprietary processor was the LEON.
NIOS has a good interface and good support for FPGA. LEON has a different interface
and is very flexible. Then, I started to study more about this processor, my
first goal from GSoC was to do all the integration of Theora Hardware with
LEON. This page describe how to do this integration step by step.
Theora: http://www.theora.org/
Xiph: http://www.xiph.org/
Theora Hardware Wiki: http://wiki.xiph.org/index.php/TheoraHardware
Google Summer of Code page: code : http://code.google.com/soc/
GSoC Project page: http://code.google.com/soc/2007/xiph/appinfo.html?csaid=4235040C184DBD68
Gaisler : http://www.gaisler.com
Vorbis Hardware implementation on LEON2: http://oggonachip.sourceforge.net/
MP3 Hardware implementation on LEON2: http://lampiao.lsc.ic.unicamp.br/~billo/leon2_on_mblazeboard/index.htm
Leon Sparc: http://tech.groups.yahoo.com/group/leon_sparc/
Theora: http://lists.xiph.org/mailman/listinfo/theora-dev
figure 1
Gaisler Research provides a complete framework for the development of processor-based
SOC designs. The framework is centered around the LEON processor core and includes
a large IP library, behavioral simulators, and related software development
tools.
http://www.gaisler.com
The GRLIB IP Library is an integrated set of reusable IP cores, designed for
system-on-chip (SOC) development. The IP cores are centered around the common
on-chip bus, and use a coherent method for simulation and synthesis.
http://gaisler.com/products/grlib/grlib.pdf
You need first to install the GRLIB (I worked with grlib-gpl-1.0.15-b2149.tar.gz
), It is following the instructions on grlib.pdf
After you have the GRLIB installed, you can run the "make xconfig" on "grlib/designs/leon3-altera-ep2s60-sdr" (I
used the Stratix II EP2S60F672C5ES).
There, you can select this components:
Component Vendor
LEON3 SPARC V8 Processor Gaisler Research
AHB Debug UART Gaisler
Research
AHB Debug JTAG TAP Gaisler
Research
LEON2 Memory Controller European
Space Agency
AHB/APB Bridge Gaisler
Research
LEON3 Debug Support Unit Gaisler
Research
Generic APB UART Gaisler
Research
My configuration file: config.in
Now, you can run the synthesis of your design (make quartus).
FPGA problem pins: You need pay attention in select the suitable design for
your FPGA, else you can have problem with pin mapping.
GRMON is a general debug monitor for the LEON processor, and for SOC designs
based on the GRLIB IP library.
We will use this to Load and execution of LEON applications
Manual: http://www.gaisler.com/doc/grmon.pdf
ftp://gaisler.com/gaisler.com/grmon/grmon-eval-1.1.21.tar.gz
Run GRMON with this command:
grmon-eval -altjtag -u
-altjtag : Connect to the JTAG Debug Link using Altera USB Blaster or Byte
Blaster.
-u : Put UART 1 in loop-back mode, and print its output on monitor console.
figure 2
Create a new path (like /theora_hardware/).
Do the download and unpack the libtheora-1.0alpha6.tar.gz on /theora_hardware/
http://downloads.xiph.org/releases/theora/libtheora-1.0alpha6.tar.gz
tar -xzf libtheora-1.0alpha6.tar.gz
Do the download and unpack the libogg-1.1.3.tar.gz on /theora_hardware/libtheora-1.0alpha6/
http://downloads.xiph.org/releases/ogg/libogg-1.1.3.tar.gz
tar -xzf libogg-1.1.3.tar.gz
Now, you will need to use the BCC (Bare-C Cross-Compiler). BCC is a cross-compiler
for LEON2 and LEON3 processors.
Do the download and unpack the sparc-elf-3.4.4-1.0.29.tar.bz2 on /opt/
mkdir /opt
tar -C /opt -xjf sparc-elf-3.4.4-1.0.29.tar.bz2
How we are not running on a Linux, you will need to take care with file functions. You can to comment the fprint, to change the fread's to a vector of inputs and the fwrite will be just a printf. Like this:
dump_video_hardware.c
insert vector_of_input.h
There was a error Bug from OGG lib:
IU in error mode (tt = 0x07)
400013a4 e8220011 st %l4, [%o0 + %l1]
The trap type 0x07 is a memory access to unaligned address. Some architectures support unaligned stores, but SPARC does not (just in 4 by 4 bytes). I had a luck in to find a report from a group that put the Vorbis decoder on FPGA. It was a master thesis of 2 students http://oggonachip.sourceforge.net/.
Then, you just need to type so extra lines in configure.in file (on Ogg library's, /theora_hardware/libtheora- 1.0alpha6/libogg-1.1.3/) as follows:
AC_CHECK_SIZEOF(short,2)
AC_CHECK_SIZEOF(int,4)
AC_CHECK_SIZEOF(long,4)
AC_CHECK_SIZEOF(long long,8)"
You can run this script
# Export sparc-elf PATH
export PATH=/opt/sparc-elf-3.4.4/bin:$PATH
# Clean all
make distclean
cd libogg-1.1.3/
make distclean
# Set CROSS-Compiler and parameters
export CC=sparc-elf-gcc
export CXX=sparc-elf-gcc
export CFLAGS='-mv8 -msoft-float -static'
# -mv-8 generate SPARC V8 mul/div instructions - needs hardware multiply and
divide
# -msoft-float emulate floating-point - must be used if no FPU exists in the
system
#Configure and install OGG lib
./configure --prefix=/theora_hardware/ --target=sparc-elf
--host=sparc-elf --enable-static
make
make install
#Configure and make Theora for LEON (sparc)
cd ../
./configure --prefix=/theora_hardware/ --target=sparc-elf
--host=sparc-elf --enable-static --disable-encode
make
After last step, you will have the binary "dump_video_hardware". At first step (The LEON processor) you generated by the synthesis a programmer file (leon3mp.sof) that now you can to programmer your FPGA. Then, open the Grmon interface and load the dump_video_hardware ("load dump_video_hardware"). Now, "run dump_video_hardware".
figure 3
LINUX support for LEON2 and LEON3 is provided through a special version of the SnapGear Embedded Linux distribution. SnapGear Linux is a full source package, containing kernel, libraries and application code for rapid development of embedded Linux systems.
Download the Snapgear:
ftp://gaisler.com/gaisler.com/linux/snapgear/snapgear-p33a.tar.bz2
Snapgear Manual:
ftp://gaisler.com/gaisler.com/linux/snapgear/snapgear-manual-1.33.0.pdf
Download the Sparc Linux Cross Compiler:
ftp://gaisler.com/gaisler.com/linux/snapgear/sparc-linux-1.0.0.tar.bz2
Kernel versions that I am using: linux-2.6.21.1 for MMU system
The tool-chain should be installed under /opt :
cd /opt
tar xjf /sparc-linux-1.0.0.tar.bz2
Add /opt/sparc-linux/bin to your PATH.
The SnapGear distribution can be installed anywhere:
tar -xjf snapgear-p33a.tar.bz2
General instructions on how to use SnapGear linux is provided with the distribution.
After programmer your FPGA with LEON3, you can open the GRMON with this command:
./grmon-eval -altjtag -nb -abaud 38400 -nosram
The GRMON should be started with -nb to avoid going into break mode on a page-fault or data exception.
Problem with SRAM
I disabled the SRAM (-nosram) because I had just 2 Mbit of SRAM on my FPGA,
then I needed to load the kernel on SDRAM. But, I was having problems of memory
mapping. Thus, I decided disable the SRAM.
Serial and jTAG Dbg Link.
The "-abaud 38400" set application baudrate for UART 1.
In order to have a konsole interface from linux you need to connect a serial
cable with you computer. Then, you can use a program like "kermit" that
provides a serial communication with your linux konsole on FPGA. Some FPGA´s
has 2 serial connectors, BE SURE that you are using the suitable connector!.
I am using the follow configuration of kermit:
set line /dev/ttyS0
define sz !sz \%0 > /dev/ttyS0 < /dev/ttyS0
set speed 38400
set carrier-watch off
set prefixing all
set parity none
set stop-bits 1
set modem none
set file type bin
set file name lit
set flow-control none
set prompt "Sparc Linux Kermit> "
c
Now, load your kernel image (image.dsu) generated with Snapgear and to see
your konsole running on kermit.
figure 4
Now, you can use the original dump_video.c because you are using the linux. Then, you can to work with files.
# Export sparc-linux PATH
export PATH=/opt/sparc-linux/bin/:$PATH
# Clean all
make distclean
cd libogg-1.1.3/
make distclean
# Set CROSS-Compiler and parameters
export CC=sparc-linux-gcc
export CXX=sparc-linux-gcc
export CFLAGS='-msoft-float -fPIC -static'
# -msoft-float emulate floating-point - must be used if no FPU exists in the
system
# -g generate debugging information - must be used for debugging with gdb
# -fPIC generate position independent machine code. It is necessary because
we are using linux now.
# -static when linking an application static, all code used from libraries
are included into the output binary
#Configure and install OGG lib
./configure --prefix=/homes_export/andre.lnc/theora/libtheora6_hard/ --target=sparc-linux
--host=sparc-linux --enable-static
make
make install
#Configure and make Theora for LEON (sparc)
cd ../
./configure --prefix=/homes_export/andre.lnc/theora/libtheora6_hard/ --target=sparc-linux
--host=sparc-linux --enable-static
make
After generate the binary for LINUX on LEON3, you need to do a copy of this to /snapgear-p33/romfs/home/ and to make a image of linux kernel with the Theora compiled (dump_video). Don`t forget to do a copy of some video to /snapgear-p33/romfs/home/. Take care about size of your linux image, your SDRAM of FPGA needs to have space for this.
Then:
Programmer your board with LEON3;
Load the linux image on LEON3 (using grmon);
Open your kermit interface and set the configuration;
Run the linux kernel (using grmon);
Come back to kermit and you will see a konsole of Linux;
Now, go to home (cd home) and run the dump_video (./dump_video video.ogg);
figure 5
AHB is a new generation of AMBA bus which is intended to address the requirements
of high-performance synthesizable designs. It is a high-performance system bus that
supports multiple bus masters and provides high-bandwidth operation.
AMBA AHB implements the features required for high-performance, high clock
frequency systems.
The APB is part of the AMBA hierarchy of buses and is optimized for minimal power
consumption and reduced interface complexity.
The AMBA APB appears as a local secondary bus that is encapsulated as a single AHB
slave device. APB provides a low-power extension to the system bus which
builds on AHB signals directly.
The APB bridge appears as a slave module which handles the bus handshake and
control signal retiming on behalf of the local peripheral bus.
You can see details:
http://www.gaisler.com/doc/amba.pdf
I was searching on
teses and articles in order to decide where would be the best place for Theora
Hardware and how I could to do the communication between software and hardware
by bus and to pass the data's for hardware. I found many differents solution.
The AHB is a high speed bus suitable to connect units with high data rate.
But, the problem is that the Theora Hardware will be a Master on AHB bus and
could overload the bus and diminish the performance of LEON3. APB is slower
than AHB. However the protocol is simpler than AHB and don't disturb the communication
between LEON3 and Memory controller. Also, I found hybrids solution with APB
and AHB, but I thought better to plug this just on APB bus.
figure 6
How to include the Theora APB core
Create the path grlib/lib/opencores/theora_hardware
Include ¨theora_hardware¨ on grlib/lib/opencores/dir.txt
Download the revision 13432 from SVN on grlib/lib/opencores/theora_hardware/:
http://svn.xiph.org/trunk/theora-fpga/
You will need to change the name of entity syncram to tsyncram of the modules: Syncram, expand block, loopfilter, copyrecon, databuffer. It is because syncram is a name used in other different component from LEON3.
Now, we need to create the theora_hardware.vhd and theora_amba_interface.vhd:
theora_hardware.vhd and theora_amba_interface.vhd
Create vhdlsyn.txt on grlib/lib/opencores/theora_hardware/vhdlsyn.txt and include all the vhdl`s
If you prefer, you can download these files here: theora_hardware1.tar
You should include the Theora Hardware APB/AMBA (OPENCORES_THEORA_HARDWARE
on VENDOR_OPENCORES) just changing the file devices.vhd (grlib/lib/grlib/amba/):
devices.vhd
Finally, we need instantiate the theora_hardware on leon3.vhd and take care
about to use a selector free of APB slave output vector (apbo(i)):
leon3mp.vhd
Before synthesis ("make quartus"), Type the commands "make distclean" and "make
script" on your path (design grlib/designs/leon3-altera-ep2s60-sdr/).
struct theora_regs_t {
volatile int flag_send_data;
volatile int data_transmitted;
volatile int flag_read_data;
volatile int data_received;
};
struct theora_regs_t * theora_regs = (struct theora_regs_t *)0x80000800;
flag_send_data (address 0x80000800): It is a flag used to the driver to know if can send a data to Theora Hardware.
data_transmitted (address 0x80000804): Data Transmitted to Theora Hardware
flag_read_data (address 0x80000808): It is a flag used to the driver to know Can the driver receive a data from Theora Hardware.
data_received (address 0x8000080C): Data received from Theora Hardware
sparc-elf-gcc -mv8 -msoft-float -g send_vector_of_input.c -o send_vector_of_input.exe
The Theora_amba_interface implement the APB/AMBA peripheral in order to receive and transmit the data's from driver to Theora_hardware using the Addressing protocol defined above and the ReconReframe protocol.
figure 7
A driver is necessary because we are using a linux. Then, a software running on linux can not write in a real address, it needs of a driver.
There are many tutorial on internet of how to do a character device, then I will not talk about these details.
The parameters of transaction between software and driver that I did are these:
struct _data
{
int read;
int wrote;
int data;
};
struct _data dt;
I/O control function: theora_ioctl(struct inode *inode, struct file *filp, unsigned int nFunc, unsigned long nParam)
If nFunc = '0' means that the driver will try to do a reading on Theora Hardware. If nFunc = '1', the driver will try to do a writting on Theora Hardware.
If occurred a successful reading, the dt.read will return 1. If not, will return 0.
If occurred a successful writting, the dt.wrote will return 1 and the data on dt.data. If not, the dt.wrote will return 0.
See the driver theora: theora.c
Include the theora.c (the driver) on snapgear-p33/linux-2.6.21.1/drivers/char/
Include the line "obj-$(CONFIG_THEORA) += theora.o" on snapgear-p33/linux-2.6.21.1/drivers/char/Makefile. Like this Makefile
Include the lines ...
config THEORA
bool "Theora Driver"
default y
... on snapgear-p33/linux-2.6.21.1/drivers/char/Kconfig. Like this Kconfig
You need to make sure to select a unique number from the snapgear-p33/linux-2.6.21.1/Documentation/devices.txt. In my case was the number 121.
Then, you need to add the line "DEVICES = theora,c,121,0 \" on snapgear-p33/vendors/gaisler/leon3mmu/Makefile. Like this Makefile. It will create a /dev/mydriver each time make is run.
Now, if you want generate the linux image, you just need to do a "make" on snapgear-p33 path
When you boot the linux from FPGA you will see these lines:
Loading theora ...
LEON THEORA driver by Andre Costa (2007) - andre.lnc@gmail.com
- Unable to handle kernel paging request at virtual address 80000000:
The MMU protects certain memory spaces, you either bypass the MMU using
the SPARC specific STA or LDA instructions (not recommended) or use
ioremap to inform the MMU about the new area. In my case I used the ioremap.
- Warning: ioremap: done with statics, switching to malloc
Error (running on FPGA): alloc_io_res(phys_80000800): cannot occupy
Halt
Halt
The problem is that you repeatedly call ioremap(). You
should do this once and keep the pointer returned from ioremap and use
this to access the hardware in the rest of the code. I was using the ioremap on ioclt(), but It should be on theora_init().
- BUG: soft lockup detected on CPU#0!
Soft lockup is when the kernel fails to reschedule for 10 seconds. This
implies that your driver does not yield the CPU. For example, in your
read/write functions you should either return immediately or sleep until
woken up by an interrupt. You may not busy wait. I was doing the loop (until receive a data from theora_hardware) on driver, but It should do just on modified libtheora software.
You will to edit the dct_decode.c from libtheora. First, open the driver: pf = open("/dev/theora",O_RDONLY|O_WRONLY|O_TRUNC|O_CREAT);
The function write_theoradriver(int pf, int data) that was implemented is responsable to send a data to the driver. Then, we need to send all the data's and receive in a correct sequence. Take care about this, if just a data was not sent or read it's can stop all the pipeline of decodification. You can receive back the data in order to compare the output.
dct_decode2.c
codec_internal.h (some little changes on this file)
figure 8
The controller consists of a YUV to RGB converter and a video signal generator that send the signal to a D/A converter.
It is a video D/A converter and It is necessary because the Stratix II doesn't have one.
You should read the Manual
figure 9
Leonardo Piga did a video controller and he plugged it on NIOS. Then I worked in order to pluged this video controller on my LEON-Theora integration and I found some problems that I will describe.
dct_decode: The differences between this dct_decode.c and dct_decode2.c is that now we don't need to receive the outputs of reconrefframe and compare with software, we just need to send the data's predecoded to reconrefframe.
Beyond this, we need to send the height and the width, because the videocontroller will request.
You can see my dct_decode: dct_decode.c and dump_video.c (Now we can't see print to any file, the data's are transmitted to theora_amba_interface)
Hierarchy of the modules: Now we have the theora_hardware that will have the reconrefframe and the video controller. It was necessary to do some adaptations (theora_apb.vhd, theora_amba_interface.vhd, theora_hardware.vhd ...). Here you can download all these modules: theora_apb.tar
Pins of Lancelot: You will need to connect all pins of lancelot on to leon system. My new leon3mp is leon3mp.vhd, and my file of connections: leon3mp.qsf
I had some difficult in to plug it on Leon, because of hardware constrains. The clock frequency used by video controller is of 25 MHz, but the frequency of Leon system is of 50 MHz. It was not just to put a simples clock divider, because on the synthesis a had problems of cross-clock domain at time analysis. The video controller (25 MHz) need to receive data's from a module of 50 MHz. It was generating a clock skew problems. The solution was simples, I needed to change some parameters on PLL of Leon system, the PLL (phase-locked loop) is basically a closed loop frequency control system that generate the clocks of Leon and sdram with the phase adjusted, I needed to include a new clock there with the correct parameters. Like this on /grlib/lib/techmap/clocks/clkgen_altera_mf.vhd
clkgen_altera_mf.vhd
The dump_video includes a band of 8 pixel green below of video. If run a video of 96x72, I will have a video of 96x80. Something like:
Ogg logical stream 583c6ca0 is Theora 96x80 29.97 fps video
Encoded frame content is 96x72 with 0x0 offset
Theora encodes the frame in whole 16x16 macro blocks, so both the width
and height must be a multiple of 16. When the actual video content is
not a multiple of 16, it is expanded to one and a clipping rectangle is
stored in the header (that's the "Encoded frame content..." message).
dump_video does not crop the output down to the actual size of this
rectangle, but outputs the entire expanded frame. The encoder by default
stores zeros in this part of the frame, so that's why it looks green.
Here you can find the ffmpeg2theora software that you can change the resolution, the start and end point, and more some things very usefull that you certainly will need to do to tests some videos.
I did a demonstration of this integration until the video controller and it is on youtube.
Click here to see the video
My current FPGA programmer file: leon3mp.sof
My current LINUX Kernel images (with theora driver and dump_video included and complied): image.dsu
There are basically two problems:
On NIOS, a video was running very slow, almost 7 times. On my LEON system it is still slow, but just 5 times, then the perfomance is a little better then NIOS. The last days I was debugging the flow in order to discovery what I can to do to increase the speed.
The time of APB/AMBA bus is OK. I did some measures and the time that it is spending to decode using a old pipeline is just 1/2 of time required, it is the time to the software to decode the first part, to send to the hardware and to the software read the output. A video of 15 second is decoded in 7 seconds. But, If I plug the video controller, it is taking 75 seconds (5 times). I am trying to fix this problem.
There is still other problem, the image is good, but there is some little purple points on image. Leonardo said to me that he is working on this problem.
The size is little because the buffer multiplexed with a external memory (SRAM) was not implemented, then we just have to user the few blocks of internal memory of FPGA.
Although this problems, I think the most important is that now we have a complete theora decoding on FPGA and with no NIOS or any module proprietary. Putting a .ogg video on linux a seeing a video on monitor.
figure 10
NOT IMPLEMENTED
figure 11
[to complete]
[to complete]