by student André Luiz Nazareth da Costa (andre.lnc [at] gmail.com)
and mentor Timothy B. Terriberry (tterribe [at] vt.edu)
Decode a video in the Theora format requires a great power of processing. In this way, the development of a specify hardware for it is a viable solution and some modules had already been made successful in hardware on GSoC 2006 (Google Summer of Code). The idea is you get the FPGA with small embedded processor and to put just the critical modules in the hardware.
Goal of my project is to give continuity to the project of the last year, putting one or more modules in hardware and then diminishing the cpu-time processing. This implementation will be done in VHDL and synthesized to the Altera StratixII FPGA. GSoC Project page: http://code.google.com/soc/2007/xiph/appinfo.html?csaid=4235040C184DBD68
The Xiph.Org Foundation (http://www.xiph.org/) is a non-profit corporation dedicated to protecting the foundations of Internet multimedia from control by private interests. The purpose is to support and develop free, open protocols and software to serve the public, developer, and business markets.Theora is the video codec from Xiph, based on the VP3 codec donated by On2 Technologies.
Google Summer of Code (http://code.google.com/soc/) is a program that offers student developers stipends to write code for various open source projects. Google works with a several open source, free software and technology-related groups to identify and fund several projects over a three month period. Historically, the program has brought together over 1,000 students with over 100 open source projects, to create hundreds of thousands of lines of code. The program, whichkicked off in 2005, is now (2007) in its third year.
The First step (analysis of theora decoding process) was studied firstly by Felipe Portavales and after by Leonardo Piga. The conclusion was that the function reconrefframes waste approximately 60% of CPU-time, but functions before have a lot of struct's of decision and few struct's of processing (like multiplication). You can see this on http://svn.xiph.org/trunk/theora-fpga/doc/. Thus, thisfirst part isn't too interesting to be done in Hardware, but the reconrefframes is.
Felipe Portavales did the iDCT and Leonardo Piga did the others functions. The VHDL simulation is OK and the synthesis in FPGA is OK too.
But, the integration was did with NIOS processor, which is a proprietary processor. The alternative of a nonproprietary processor was the LEON.
NIOS has a good interface and good support for FPGA. LEON has a different interface and is very flexible. Then, I started to study more about this processor, my first goal from GSoC was to do all the integration of Theora Hardware with LEON. This page describe how to do this integration step by step.
Theora: http://www.theora.org/
Xiph: http://www.xiph.org/
Theora Hardware Wiki: http://wiki.xiph.org/index.php/TheoraHardware
Google Summer of Code page: code : http://code.google.com/soc/
GSoC Project page: http://code.google.com/soc/2007/xiph/appinfo.html?csaid=4235040C184DBD68
Gaisler : http://www.gaisler.com
Vorbis Hardware implementation on LEON2: http://oggonachip.sourceforge.net/
MP3 Hardware implementation on LEON2: http://lampiao.lsc.ic.unicamp.br/~billo/leon2_on_mblazeboard/index.htm
Leon Sparc: http://tech.groups.yahoo.com/group/leon_sparc/
Theora: http://lists.xiph.org/mailman/listinfo/theora-dev
figure 1
Gaisler Research provides a complete framework for the development of processor-based SOC designs. The framework is centered around the LEON processor core and includes a large IP library, behavioral simulators, and related software development tools.
http://www.gaisler.com
The GRLIB IP Library is an integrated set of reusable IP cores, designed for system-on-chip (SOC) development. The IP cores are centered around the common on-chip bus, and use a coherent method for simulation and synthesis.
http://gaisler.com/products/grlib/grlib.pdf
You need first to install the GRLIB (I worked with grlib-gpl-1.0.15-b2149.tar.gz ), It is following the instructions on grlib.pdf
After you have the GRLIB installed, you can run the "make xconfig" on "grlib/designs/leon3-altera-ep2s60-sdr" (I used the Stratix II EP2S60F672C5ES).
There, you can select this components:
Component Vendor
LEON3 SPARC V8 Processor Gaisler Research
AHB Debug UART Gaisler Research
AHB Debug JTAG TAP Gaisler Research
LEON2 Memory Controller European Space Agency
AHB/APB Bridge Gaisler Research
LEON3 Debug Support Unit Gaisler Research
Generic APB UART GaislerResearch
My configuration file: config.in
Now, you can run the synthesis of your design (make quartus).
FPGA problem pins: You need pay attention in select the suitable design for your FPGA, else you can have problem with pin mapping.
GRMON is a general debug monitor for the LEON processor, and for SOC designs based on the GRLIB IP library.
We will use this to Load and execution of LEON applications
Manual: http://www.gaisler.com/doc/grmon.pdf
ftp://gaisler.com/gaisler.com/grmon/grmon-eval-1.1.21.tar.gz
Run GRMON with this command:
grmon-eval -altjtag -u
-altjtag : Connect to the JTAG Debug Link using Altera USB Blaster or Byte Blaster.
-u : Put UART 1 in loop-back mode, and print its output on monitor console.
figure 2
Create a new path (like /theora_hardware/).
Do the download and unpack the libtheora-1.0alpha6.tar.gz on /theora_hardware/
http://downloads.xiph.org/releases/theora/libtheora-1.0alpha6.tar.gz
tar -xzf libtheora-1.0alpha6.tar.gz
Do the download and unpack the libogg-1.1.3.tar.gz on /theora_hardware/libtheora-1.0alpha6/
http://downloads.xiph.org/releases/ogg/libogg-1.1.3.tar.gz
tar -xzf libogg-1.1.3.tar.gz
Now, you will need to use the BCC (Bare-C Cross-Compiler). BCC is a cross-compiler for LEON2 and LEON3 processors.
Do the download and unpack the sparc-elf-3.4.4-1.0.29.tar.bz2 on /opt/
mkdir /opt
tar -C /opt -xjf sparc-elf-3.4.4-1.0.29.tar.bz2
How we are not running on a Linux, you will need to take care with file functions. You can to comment the fprint, to change the fread's to a vector of inputs and the fwrite will be just a printf. Like this:
dump_video_hardware.c
insert vector_of_input.h
There was a error Bug from OGG lib:
IU in error mode (tt = 0x07)
400013a4 e8220011 st %l4, [%o0 + %l1]
The trap type 0x07 is a memory access to unaligned address. Some architectures support unaligned stores, but SPARC does not (just in 4 by 4 bytes). I had a luck in to find a report from a group that put the Vorbis decoder on FPGA. It was a master thesis of 2 students http://oggonachip.sourceforge.net/.
Then, you just need to type so extra lines in configure.in file (on Ogg library's, /theora_hardware/libtheora- 1.0alpha6/libogg-1.1.3/) as follows:
AC_CHECK_SIZEOF(short,2)
AC_CHECK_SIZEOF(int,4)
AC_CHECK_SIZEOF(long,4)
AC_CHECK_SIZEOF(long long,8)"
You can run this script
# Export sparc-elf PATH
export PATH=/opt/sparc-elf-3.4.4/bin:$PATH
# Clean all
make distclean
cd libogg-1.1.3/
make distclean
# Set CROSS-Compiler and parameters
export CC=sparc-elf-gcc
export CXX=sparc-elf-gcc
export CFLAGS='-mv8 -msoft-float -static'
# -mv-8 generate SPARC V8 mul/div instructions - needs hardware multiply and divide
# -msoft-float emulate floating-point - must be used if no FPU exists in the system
#Configure and install OGG lib
./configure --prefix=/theora_hardware/ --target=sparc-elf --host=sparc-elf --enable-static
make
make install
#Configure and make Theora for LEON (sparc)
cd ../
./configure --prefix=/theora_hardware/ --target=sparc-elf --host=sparc-elf --enable-static --disable-encode
make
After last step, you will have the binary "dump_video_hardware". At first step (The LEON processor) you generated by the synthesis a programmer file(leon3mp.sof) that now you can to programmer your FPGA. Then, open the Grmon interface and load the dump_video_hardware ("load dump_video_hardware"). Now, "run dump_video_hardware".
figure 3
LINUX support for LEON2 and LEON3 is provided through a special version of the SnapGear Embedded Linux distribution. SnapGear Linux is a full source package, containing kernel, libraries and application code for rapiddevelopment of embedded Linux systems.
Download the Snapgear:
ftp://gaisler.com/gaisler.com/linux/snapgear/snapgear-p33a.tar.bz2
Snapgear Manual:
ftp://gaisler.com/gaisler.com/linux/snapgear/snapgear-manual-1.33.0.pdf
Download the Sparc Linux Cross Compiler:
ftp://gaisler.com/gaisler.com/linux/snapgear/sparc-linux-1.0.0.tar.bz2
Kernel versions that I am using: linux-2.6.21.1 for MMU system
The tool-chain should be installed under /opt :
cd /opt
tar xjf /sparc-linux-1.0.0.tar.bz2
Add /opt/sparc-linux/bin to your PATH.
The SnapGear distribution can be installed anywhere:
tar -xjf snapgear-p33a.tar.bz2
General instructions on how to use SnapGear linux is provided with the distribution.
After programmer your FPGA with LEON3, you can open the GRMON with this command:
./grmon-eval -altjtag -nb -abaud 38400 -nosram
The GRMON should be started with -nb to avoid going into break mode on a page-fault or data exception.
Problem with SRAM
I disabled the SRAM (-nosram) because I had just 2 Mbit of SRAM on my FPGA, then I needed to load the kernel on SDRAM. But, I was having problems of memory mapping. Thus, I decided disable the SRAM.
Serial and jTAG Dbg Link.
The "-abaud 38400" set application baudrate for UART 1.
In order to have a konsole interface from linux you need to connect a serial cable with you computer. Then, you can use a program like "kermit" that provides a serial communication with your linux konsole on FPGA. Some FPGA´s has 2 serial connectors, BE SURE that you are using the suitable connector!.
I am using the follow configuration of kermit:
set line /dev/ttyS0
define sz !sz \%0 > /dev/ttyS0 < /dev/ttyS0
set speed 38400
set carrier-watch off
set prefixing all
set parity none
set stop-bits 1
set modem none
set file type bin
set file name lit
set flow-control none
set prompt "Sparc Linux Kermit> "
c
Now, load your kernel image (image.dsu) generated with Snapgear and to see your konsole running on kermit.
figure 4
Now, you can use the original dump_video.c because you are using the linux.Then, you can to work with files.
# Export sparc-linux PATH
export PATH=/opt/sparc-linux/bin/:$PATH
# Clean all
make distclean
cd libogg-1.1.3/
make distclean
# Set CROSS-Compiler and parameters
export CC=sparc-linux-gcc
export CXX=sparc-linux-gcc
export CFLAGS='-msoft-float -fPIC -static'
# -msoft-float emulate floating-point - must be used if no FPU exists in the system
# -g generate debugging information - must be used for debugging with gdb
# -fPIC generate position independent machine code. It is necessary because we are using linux now.
# -static when linking an application static, all code used from libraries are included into the output binary
#Configure and install OGG lib
./configure --prefix=/homes_export/andre.lnc/theora/libtheora6_hard/ --target=sparc-linux --host=sparc-linux --enable-static
make
make install
#Configure and make Theora for LEON (sparc)
cd ../
./configure --prefix=/homes_export/andre.lnc/theora/libtheora6_hard/ --target=sparc-linux --host=sparc-linux --enable-static
make
After generate the binary for LINUX on LEON3, you need to do a copy of this to /snapgear-p33/romfs/home/ and to make a image of linux kernel with the Theora compiled (dump_video). Don`t forget to do a copy of some video to /snapgear-p33/romfs/home/. Take care about size of your linux image, your SDRAM of FPGA needs to havespace for this.
Then:
Programmer your board with LEON3;
Load the linux image on LEON3 (using grmon);
Open your kermit interface and set the configuration;
Run the linux kernel (using grmon);
Come back to kermit and you will see a konsole of Linux;
Now, go to home (cd home) and run the dump_video (./dump_video video.ogg);
figure 5
AHB is a new generation of AMBA bus which is intended to address the requirementsof high-performance synthesizable designs. It is a high-performance system bus thatsupports multiple bus masters and provides high-bandwidth operation.
AMBA AHB implements the features required for high-performance, high clockfrequency systems.
The APB is part of the AMBA hierarchy of buses and is optimized for minimal powerconsumption and reduced interface complexity.The AMBA APB appears as a local secondary bus that is encapsulated as a single AHBslave device. APB provides a low-power extension to the system bus whichbuilds on AHB signals directly.
The APB bridge appears as a slave module which handles the bus handshake andcontrol signal retiming on behalf of the local peripheral bus.
You can see details:
http://www.gaisler.com/doc/amba.pdf
I was searching on teses and articles in order to decide where would be the best place for Theora Hardware and how I could to do the communication between software and hardware by bus and to pass the data's for hardware. I found many differents solution.
The AHB is a high speed bus suitable to connect units with high data rate. But, the problem is that the Theora Hardware will be a Master on AHB bus and could overload the bus and diminish the performance of LEON3. APB is slower than AHB. However the protocol is simpler than AHB and don't disturb the communication between LEON3 and Memory controller. Also, I found hybrids solution with APBand AHB, but I thought better to plug this just on APB bus.
figure 6
How to include the Theora APB core
Create the path grlib/lib/opencores/theora_hardware
Include ¨theora_hardware¨ on grlib/lib/opencores/dir.txt
Download the revision 13432 from SVN on grlib/lib/opencores/theora_hardware/:
http://svn.xiph.org/trunk/theora-fpga/
You will need to change the name of entity syncram to tsyncram of the modules: Syncram, expand block, loopfilter, copyrecon, databuffer. It is because syncram is a name used in other different component from LEON3.
Now, we need to create the theora_hardware.vhd and theora_amba_interface.vhd:
theora_hardware.vhd and theora_amba_interface.vhd
Create vhdlsyn.txt on grlib/lib/opencores/theora_hardware/vhdlsyn.txt and include all the vhdl`s
If you prefer, you can download these files here: theora_hardware1.tar
You should include the Theora Hardware APB/AMBA (OPENCORES_THEORA_HARDWARE on VENDOR_OPENCORES) just changing the file devices.vhd (grlib/lib/grlib/amba/):
devices.vhd
Finally, we need instantiate the theora_hardware on leon3.vhd and take care about to use a selector free of APB slave output vector (apbo(i)):
leon3mp.vhd
Before synthesis ("make quartus"), Type the commands "make distclean" and "make script" on your path (design grlib/designs/leon3-altera-ep2s60-sdr/).
struct theora_regs_t {
volatile int flag_send_data;
volatile int data_transmitted;
volatile int flag_read_data;
volatile int data_received;
};
struct theora_regs_t * theora_regs = (struct theora_regs_t *)0x80000800;
flag_send_data (address 0x80000800): It is a flag used to the driver to know if can send a data to Theora Hardware.
data_transmitted (address 0x80000804): Data Transmitted to Theora Hardware
flag_read_data (address 0x80000808): It is a flag used to the driver to know Can the driver receive a data from Theora Hardware.
data_received (address 0x8000080C): Data received from Theora Hardware
sparc-elf-gcc -mv8 -msoft-float -g send_vector_of_input.c -o send_vector_of_input.exe
The Theora_amba_interface implement the APB/AMBA peripheral in order to receive and transmit the data's from driver to Theora_hardware using the Addressing protocol defined above and the ReconReframe protocol.
figure 7
A driver is necessary because we are using a linux. Then, a software running on linux can not write in a real address, it needs of a driver.
There are many tutorial on internet of how to do a character device, then I will not talk about these details.The parameters of transaction between software and driver that I did are these:
struct _data
{
int read;
int wrote;
int data;
};
struct _data dt;
I/O control function: theora_ioctl(struct inode *inode, struct file *filp, unsigned int nFunc, unsigned long nParam)
If nFunc = '0' means that the driver will try to do a reading on Theora Hardware. If nFunc = '1', the driver will try to do a writting on Theora Hardware.
If occurred a successful reading, the dt.read will return 1. If not, will return 0.
If occurred a successful writting, the dt.wrote will return 1 and the data on dt.data. If not, the dt.wrote will return 0.
See the driver theora: theora.c
Include the theora.c (the driver) on snapgear-p33/linux-2.6.21.1/drivers/char/
Include the line "obj-$(CONFIG_THEORA) += theora.o" on snapgear-p33/linux-2.6.21.1/drivers/char/Makefile. Like this Makefile
Include the lines ...
config THEORA
bool "Theora Driver"
default y
... on snapgear-p33/linux-2.6.21.1/drivers/char/Kconfig. Like this Kconfig
You need to make sure to select a unique number from the snapgear-p33/linux-2.6.21.1/Documentation/devices.txt. In my case was the number 121.
Then, you need to add the line "DEVICES = theora,c,121,0 \" on snapgear-p33/vendors/gaisler/leon3mmu/Makefile. Like this Makefile. It will create a /dev/mydriver each time make is run.
Now, if you want generate the linux image, you just need to do a "make" on snapgear-p33 path
When you boot the linux from FPGA you will see these lines:
Loading theora ...
LEON THEORA driver by Andre Costa (2007) - andre.lnc@gmail.com
- Unable to handle kernel paging request at virtual address 80000000:
The MMU protects certain memory spaces, you either bypass the MMU usingthe SPARC specific STA or LDA instructions (not recommended) or useioremap to inform the MMU about the new area. In my case I used the ioremap.
- Warning: ioremap: done with statics, switching to malloc
Error (running on FPGA): alloc_io_res(phys_80000800): cannot occupy
Halt
Halt
The problem is that you repeatedly call ioremap(). Youshould do this once and keep the pointer returned from ioremap and usethis to access the hardware in the rest of the code. I was using the ioremap on ioclt(), but It should be on theora_init().
- BUG: soft lockup detected on CPU#0!
Soft lockup is when the kernel fails to reschedule for 10 seconds. Thisimplies that your driver does not yield the CPU. For example, in yourread/write functions you should either return immediately or sleep untilwoken up by an interrupt. You may not busy wait. I was doing the loop (until receive a data from theora_hardware) on driver, but It should do just on modified libtheora software.
You will to edit the dct_decode.c from libtheora. First, open the driver: pf = open("/dev/theora",O_RDONLY|O_WRONLY|O_TRUNC|O_CREAT);The function write_theoradriver(int pf, int data) that was implemented is responsable to send a data to the driver. Then, we need to send all the data's and receive in a correct sequence. Take care about this, if just a data was not sent or read it's can stop all the pipeline of decodification. You can receive back the data in order to compare the output.
dct_decode2.c
codec_internal.h (some little changes on this file)
figure 8
The controller consists of a YUV to RGB converter and a video signal generator that send the signal to a D/A converter.
It is a video D/A converter and It is necessary because the Stratix II doesn't have one.
You should read the Manual
figure 9
Leonardo Piga did a video controller and he plugged it on NIOS. Then I worked in order to pluged this video controller on my LEON-Theora integration and I found some problems that I will describe.
dct_decode: The differences between this dct_decode.c and dct_decode2.c is that now we don't need to receive the outputs of reconrefframe and compare with software, we just need to send the data's predecoded to reconrefframe.Beyond this, we need to send the height and the width, because the videocontroller will request.
You can see my dct_decode: dct_decode.c and dump_video.c (Now we can't see print to any file, the data's are transmitted to theora_amba_interface)
Hierarchy of the modules: Now we have the theora_hardware that will have the reconrefframe and the video controller. It was necessary to do some adaptations (theora_apb.vhd, theora_amba_interface.vhd, theora_hardware.vhd ...). Here you can download all these modules: theora_apb.tar
Pins of Lancelot: You will need to connect all pins of lancelot on to leon system. My new leon3mp is leon3mp.vhd, and my file of connections: leon3mp.qsf
I had some difficult in to plug it on Leon, because of hardware constrains. The clock frequency used by video controller is of 25 MHz, but the frequency of Leon system is of 50 MHz. It was not just to put a simples clock divider, because on the synthesis a had problems of cross-clock domain at time analysis. The video controller (25 MHz) need to receive data's from a module of 50 MHz. It was generating a clock skew problems. The solution was simples, I needed to change some parameters on PLL of Leon system, the PLL (phase-locked loop) is basically a closed loop frequency control system that generate the clocks of Leon and sdram with the phase adjusted, I needed to include a new clock there with the correct parameters. Like this on /grlib/lib/techmap/clocks/clkgen_altera_mf.vhd
clkgen_altera_mf.vhd
The dump_video includes a band of 8 pixel green below of video. If run a video of 96x72, I will have a video of 96x80. Something like:
Ogg logical stream 583c6ca0 is Theora 96x80 29.97 fps video
Encoded frame content is 96x72 with 0x0 offset
Theora encodes the frame in whole 16x16 macro blocks, so both the widthand height must be a multiple of 16. When the actual video content isnot a multiple of 16, it is expanded to one and a clipping rectangle isstored in the header (that's the "Encoded frame content..." message).dump_video does not crop the output down to the actual size of thisrectangle, but outputs the entire expanded frame. The encoder by defaultstores zeros in this part of the frame, so that's why it looks green.
At first tests, there were littles purple points on the image, It needed of a shift phase on video controller clock (25 MHz). We think that the reading of video controller memory was happening at the same phase of writting.
Here you can find the ffmpeg2theora software that you can change the resolution, the start and end point, and more some things very usefull that you certainly will need to do to tests some videos.
I did a demonstration of this integration until the video controller and it is on youtube. Click here to see the video
My current FPGA programmer file: leon3mp.sof
My current LINUX Kernel images (with theora driver and dump_video included and complied): image.dsu
The size is little because the buffer multiplexed with a external memory (SRAM) was not implemented, then we just have to user the few blocks of internal memory of FPGA.
There are basically one problem in this presentation. On NIOS, a video was running very slow, almost 7 times. On my LEON system it is still slow, but just 5 times, then the perfomance is a little better then NIOS. A video of 15 second is taking 75 seconds (5 times).
I discovered that the problem was on LINUX! If I don't use the linux (like is done on NIOS), I can to decode much faster than the time of exibition!
The problem is the LINUX Call systems, because I am calling the driver for each word that I want to send. The solution would be to do a copy from a block of words to the driver, but it isn't implemented for while.
Without the LINUX, a video (30 seconds) encoded with the best quality (ffmpeg2theora -x 96 -y 80 -v 10 -e 30 original.ogg -o video.ogg) is decoded in 25 seconds. Below I will discribe this other implementation.
figure 10
NOT IMPLEMENTED
figure 11
[to complete]
[to complete]