Ticket #123 (assigned enhancement)

Opened 11 months ago

Last modified 10 months ago

Implement filters

Reported by: vreixo Owned by: vreixo
Priority: major Milestone: libisofs-0.6.4
Component: libisofs Version: libisofs-0.6.3
Keywords: Cc:

Description

The idea is to implement the concept of a Filter, i.e., the possibility "filter" file content before writing them to image. This filtering process can consist of:

  • cut-off some parts of the file
  • Transform file contents: encoding, compression, encryption...

The idea is that a FilterStream or similar takes care of applying the Filter. The discussion is whether a filter implementation means creating its own Stream (i.e. GzStream, EncryptStream...), or we can just provide a generic FilterStream, that takes a reference to a IsoFilter interface, that is what each filter implements. In this second case, the idea is that the Filter can be shared among several nodes:

	IsoFilter *filter = iso_filter_gz_create(...);
	iso_file_add_filter(file1, filter);
	iso_file_add_filter(file2, filter);
	...

i.e., the filter is a place whether we can store configuration options for the Filter (encryption algorithm, key, ....). In this case, the FilterStream read function should be something like

    int filter_stream_read(Stream s, buffer,...)
    {
        FilterStreamData *data = s->data;
        Stream *source = data->source;
        Filter *f = data->filter;

        source->read(tmpbuffer)
	filter->filter(tempbuffer into buffer)
    }

However, it seems the filter->filter() function is not trivial to implement, as different filters may need different data chunks.

Another solution is to just implement each filter as an IsoStream implementation. In this case, it is each filter who implements its own stream->read() function. This needs, however, an ugly API, as the user needs to create each "FilterStream" implementation. And, at the end, we need the Filter idea (ie. a shared context) anyway.

i.e, we need something like:

	IsoFilter *filter1 = iso_filter_gz_create(...);
	iso_file_add_filter(file1, filter1);
	IsoFilter *filter2 = iso_filter_gz_create(...);
	iso_file_add_filter(file2, filter);

and this if we extend the IsoStream interface to a FilterStream whether we define the original_stream field. Otherwise we need something like:

	IsoStream *orig_stream = iso_file_get_stream(file1);
	IsoFilter *filter1 = iso_filter_gz_create(orig_stream, ...);
	iso_file_add_filter(file1, filter1);

or maybe directly, of course

	iso_file_add_gz_filter(file1, ....);

but we still have the problem of the impossibility to use the shared context.

A final alternative is to define a generic FilterContext:

    struct FilterContext {
        void *data; //filter specific shared data
        IsoStream (*get_filter)(IsoFile*);
    }

whether the get_filter is a factory method to create the concrete Filter implementation for each file. The API usage will be, then:

	FilterContext *filter = iso_filter_gz_create(...options...);
	//the filter->get_filter gets filled with a ptr to a filter-dependent function

	iso_file_add_filter(file, filter);
	// it calls the filter->get_filter() function to get the IsoStream that is filter dependent.
        // the user does not need to know the concrete IsoStream implementation for each filter.

Some considerations:

  • Given the filter can change file size, we would need to apply the filter twice: when image structures are computed, are when the file is actually written. With complex filters, this can be a problem. Thus, all filters must have a property "on_the_fly", that decides whether the filter is applying each time the file must be read, or whether it should be applied once and stored in a temporal folder. The user could decide whether to priorize temporal hard disk space or computation time based on that flag. It is legal to ignore that flag (for example, in the cut-off filter it make no sense to store a temporal file).
  • I wonder whether we shoudl provide some kind of plugin system with filters. Ideas?

Change History

  Changed 11 months ago by pygi

  • milestone set to libisofs-0.6.4

  Changed 11 months ago by scdbackup

Finally my cut-out-node (see my comments to tickets 121 and 122).

The questions here are hard to decide. I propose you implement the cut-out node and a gzip compression as convenience functions in the libisofs API without yet exposing the underlying general mechanism. Both should yield special node types which encapsulate the filter entrails.

While getting to run the two proposed special filters you can explore advantages and drawbacks of filter implementations. We can also invent new useful special nodes and finally expose an API to build custom IsoFilterNode? objects.

Too much preplanning now will only lead to delays in implementation and will nevertheless have to be partly revoked (with high probability). So better lets start to show some features and then abstrahate from them to a customizable re-implementation.

  Changed 11 months ago by scdbackup

All filter algorithms for now should yield the same number of bytes if applied to the same source object with the same parameters several times. Buffering data is a pain. We should leave it to the filter implementation to do it if its programmer has no better idea.

Changing source object content and resulting length changes is no new problem and already handled well by the according MISHAP events.

Especially it must be harmless to read the content data an arbitrary number of times.

Filters put in question my wish about *_lseek() with IsoStream?. I would possibly retract that random access wish in favor of a gzip filter. But i must be able to close and re-open the stream for multiple passes over the content.

  Changed 11 months ago by vreixo

  • status changed from new to assigned

The questions here are hard to decide. I propose you implement the cut-out node and a gzip compression as convenience functions in the libisofs API without yet exposing the underlying general mechanism.

I disagree. Filters are powerful. It would be great to expose them in an API, in such a way users can implement their own filters. There are lots of possible use cases for filters. We can't implement all of them. Let users do so. Of course, filter implemented inside libisofs will have a simple API.

custom IsoFilterNode?? objects.

What is this? A kind of IsoFileSource? A kind of IsoNode? A thing completelly different?

So better lets start to show some features and then abstrahate from them to a customizable re-implementation.

yes, why not? But the customizable re-implementation must be part of the 0.6.3 API, in my opinion.

All filter algorithms for now should yield the same number of bytes if applied to the same source object with the same parameters several times.

yes

Buffering data is a pain.

But useful. Suppose a complex processing, for example a filter that converts mp3 files to vorbis, or mpeg videos to xvid... an user may prefer to waste some MBs of temporal usage that appling the filter several times.

Implementation is not so hard. The first time a filter is applied, the contents are written to a temporal file. Following times, we read from that temporal files. Fex extra lines of code, actually.

For me the problem is time cost, and not changes in source content. That case can't be handled propertly, it is just a MISHAP.

I would possibly retract that random access wish in favor of a gzip filter. But i must be able to close and re-open the stream for multiple passes over the content.

See my comment to ticket #121. IsoStream have an is_repeatable() funtion. If it returns != 0, you are able to read content several times. If not, that is simply not possible (for example, when reading from pipes).

  Changed 11 months ago by scdbackup

I can only re-iterate my particular wish for reading data via a loaded IsoImage? and its IsoNodes?. IsoFilesystem? is not a solution for that because it only can deal with read-only filesystems (i.e. those which reside in a single blob of data which can be accessed by a single IsoFileSource?).

If this can be done with filters soon - fine. If not, then i would like to see a solution based on other means.

Similar: the cut-out nodes are needed to give xorriso full backup grade capabilities. It needs to be able to handle oversized files on-the-fly.

So the architectural beauty of any filter mechanism is of subordinate priority to me. I would prefer if it could stay encapsulated for a while.

  Changed 11 months ago by vreixo

I can only re-iterate my particular wish for reading data via a loaded IsoImage

Reading data is not the purpose of Filters. Filters modify data. You will need to use IsoStream to read data.

  Changed 10 months ago by vreixo

I've taken a look at the cut-out filter. It can be easily implemented as a Filter. But I think filters are not the better way to handle it. Reason: if we want to filter bytes from 2GB - 4 GB of a big file, we are forced to read (and discard) first 2GB. Bad option. A filter is like a pipe where input data gets modified and ouput. That's the reason they take a stream of bytes. It should work with any stream of bytes. However, the cut-out filter can take advantage of the random-read of regular files, and indeed are only useful in this case.

I think this should be either created as a CutOutStream?! I know filters are also IsoStreams?, but the idea of filter is also to be created once (parameter) and then applied to existing IsoFiles?. Cut-out "filter", however, fits better in the idea of Builder. It is the builder (responsible of creating the IsoFile from the IsoSourceFile?) who must take care of big files, and cut them propertly, using the CutOutStream? instead of the usual IsoFileSourceStream?.

Even better, we can modify Builder interface to be able to return several IsoNodes? from a single IsoFileSource? (it has some other use cases, for example when auto-unpacking tars...)

We could wait until builders get implemented. However, given that in any case files greater than 4 GB can't be written, I propose to implement them in current default Builder, and exposed as an option like "follow_symlinks":

iso_tree_set_split_file_files(int split, off_t max_size, off_t split_size)

the idea is that if split is 1, files greater than max_size are splitted in split_size chunks, i.e., the builder generates several IsoFiles? for it. Problem: automatic name generation. We can choose between a single name generation that just append a number at the end:

splited.file.1 splited.file.2 splited.file.3 ...

or let users specify it in a more complex way.

  Changed 10 months ago by scdbackup

I personally, in my role as developer of scdbackup, would need explicit splitting which creates one node out of a byte interval. scdbackup does not only split because of the 2GB limit but also because files do not fit on a single media.

The automat appears interesting for large media where all resulting nodes of a large file can be stored.

So the specialized CutOutStream? which relies on the inner capability to perform random access reading seems necessary. (From outside it appears as stream, but inside it needs lseek() or similar functionality and can only operate on input objects which allow random access reading.)

This raises as next the question what example to implement for filters. The xor "encryption" is not really presentable.

Implement some real encryption ? Or a gzip based compressor ?

  Changed 10 months ago by vreixo

I've created a new branch (https://code.launchpad.net/~metalpain2002/libisofs/vreixo-filters) for filter implementations. Please take a look at it. I've added a GZip compression filter, you can test it with demo/isoz little program. It creates an image and gzip's regular files on root directory.

Note that this branch adds a dependency with zlib. I'm not sure whether adding new dependences is a good idea. What do you think? Given that filters are an obvious candidate for incrementing dependences, I wonder whether we should create a new library (libisofs-filters) where the filters were implemented. The other alternative, conditional compilation, would be a pain together with dynamic linking, given that some apps may need a library compiled with a given filter. I think libisofs-filters optional lib is the better solution for developers and packagers. Libisofs will define filter interface and functions to apply them. The several filter implementations will be part of libsisofs-filters.

I do not propose a new project, filters library would be a new folder inside libisofs project.

Comments? Ideas?

  Changed 10 months ago by scdbackup

Please no mandatory dependencies to other libraries. xorriso would inherit them.

Try to find an architecture where such additional dependencies and capabilities can be added at the installation site.

Currently scdbackup suffers from some mess-up about libreadline on some systems. I am not sure whether it is the admin or the system or libreadline, but it demonstrates that even such a simple dependency as libreadline has its pitfalls.

follow-up: ↓ 12   Changed 10 months ago by scdbackup

I am not sure whether this gesture from xorriso-standalone configure.ac would suffice

dnl Check whether there is readline-devel and readline-runtime.
dnl If not, erase this macro which would enable use of readline(),add_history()
READLINE_DEF="-DXorriso_with_readlinE"
dnl The empty yes case obviously causes -lreadline to be linked
AC_CHECK_HEADER(readline/readline.h, AC_CHECK_LIB(readline, readline, , READLINE_DEF= ), READLINE_DEF= )
dnl The X= in the yes case prevents that -lreadline gets linked twice
AC_CHECK_HEADER(readline/history.h, AC_CHECK_LIB(readline, add_history, X= , READLINE_DEF= ), READLINE_DEF= )
AC_SUBST(READLINE_DEF)

The variable READLINE_DEF is then used in Makefile.am

# No readline in the vanilla version because the necessary headers
# are in a separate readline-development package.
xorriso_xorriso_CFLAGS = -DXorriso_standalonE -DXorriso_with_maiN -DXorriso_with_regeX $(READLINE_DEF)

Inside xorriso.c there are #ifdef Xorriso_with_readlinE.

in reply to: ↑ 11   Changed 10 months ago by vreixo

Replying to scdbackup:

The variable READLINE_DEF is then used in Makefile.am ... Inside xorriso.c there are #ifdef Xorriso_with_readlinE.

This is exactly what I don't want. Conditional compilation on a library is, in my opinion, a bad idea. For xorriso is ok.

  Changed 10 months ago by scdbackup

Conditional compilation on a library is, in my opinion, a bad idea.

Then we need some other mechanism not to prevent to use of libisofs without a growing number of other libraries. The filter concept calls for such dependencies. We want them. But only if they can be fulfilled.

When we had a similar problem between libisofs and libburn, we invented libisoburn. But we can hardly introduce an extra library for any single filter.

  Changed 10 months ago by vreixo

I really think the better alternative is to provide a new library, libisofilters. This new library will hold all filters that will be implemented in the future, and of course libisofs does not depend on it.

Applications may decide whether to depend on it or not.

Note that if in a future we have hundreds of filters, it may be a good idea even to create several libraries.

Note: See TracTickets for help on using tickets.