Monday, May 10, 2010

Lots and Lots of PostgreSQL Feature Requests

I was surprised and pleased to see that my last blog post, concerning possible directions for future PostgreSQL development, got about five times as many page views as my previous five posts put together, and a total of 70 comments (to date). This may be partly because it got syndicated on LWN, where a few more comments were also posted. I've gone through the comments posted on the blog itself and on the LWN article and counted up the number of times each feature was mentioned. Of course, this is totally unscientific, but it matches up fairly well to the results of previous surveys and gives me an excuse to talk about a bunch of interesting features.

1. Materialized Views (12). See my previous post on Materialized Views in PostgreSQL. Long story short, we may quite possibly get a simple version of this in PostgreSQL 9.1, but I suspect a lot more work will be needed to meet some of the use cases people have in mind.

2. Multi-master replication (6). This is more or less the holy grail of database geeks; it's a pretty hard problem. As it turns out, there is a project in the works called Postgres-XC which does just this. I hope to learn more about this project next week at PGcon. My understanding is that it currently supports only a subset of the SQL query types supported by PostgreSQL, but that work is underway to remove these limitations. Currently, none of the Postgres-XC code can be considered for including in core PostgreSQL because it uses a different license (LGPL), but it's still very interesting as an independent project.

3. Index-organized tables and/or index-only scans and/or automatic maintenance of CLUSTER order (6). I've grouped all of these features together because they're really driving toward the same underlying goal: reducing the I/O cost of an index scan. PostgreSQL will most likely not implement index-organized tables in the sense that Oracle has them, wherein, as I understand it, the table data is stored in the leaf pages of an index. However, we probably will implement index-only scans, which will allow us to gain some of the same performance benefits. Automatic maintenance of CLUSTER order would help, too, but I am not aware that anyone is currently working on that project.

4. MERGE (6). There is a (very ambitious) Google Summer of Code project to implement this feature. Stay tuned. If you're curious about what this feature actually does, I think Simon Riggs has written the best description of MERGE I've seen so far, together with some discussion of the implementation issues. A few weeks later he discussed he followed up with some further notes on the design of the feature.

5. Partitioning syntax (5). Itagaki Takahiro proposed a patch to implement this for PostgreSQL 9.0, but we simply ran out of time. I am hopeful that this will come back to life and be improved and eventually committed for PostgreSQL 9.1. The patch as proposed would have allowed the use of dedicated syntax to specify partitions and would have automatically created appropriate CHECK constraints for each partition, but would not have automatically routed INSERTs to the proper partition, which I think is necessary to make this really useful. Of course, others may have different opinions. :-)

6. Parallel query execution (5). This is a very good idea and yet also a very hard problem; I am not aware that anyone has even proposed a possible design for this, yet alone attempted to implement it. If we implement better SQL/MED support, it might be possible to get some of the benefits of this feature by spreading out data across multiple nodes but making it all appear local by creating remote tables. Or, it might be possible to leverage some of the I/O bandwidth of remote nodes by adding a feature to support non-local tablespaces (with some kind of specialized daemon process reading and writing remote pages on request). But neither of these are spot-on: what we really want is the ability to parallelize a query on a single node.

7. Procedures that can BEGIN, COMMIT, or ABORT transactions (4). This is another feature that would be great to have, but I am not aware that anyone is currently working on it.

8. Granular collation support (4). There are really two halves to this project. The SQL standard specifies a set of complex rules for determining which collation should be used for a particular comparison or ORDER BY operation. In PostgreSQL, you could imagine setting the collation for a particular setting using the "SET" command; associating a collation with a particular column; or overriding the collation for a particular instance of ORDER BY or a particular use of the < operator. So, one half of this problem is simply being able to recognize which collation applies in a particular context and doing the sort or comparison under that collation. The other half of the problem is extending our indexing system to handle multiple collations - either the ability to create an index with a particular collation (which can then be used to satisfy queries that pertain to that collation) or even the ability to create a single index which can somehow answer queries pertaining to multiple collations.

9. Better drivers (4). JDBC was mentioned twice, and asynchronous drivers were mentioned twice. It was also suggested that we should have an "official" Python driver. As a project, we've generally been wary about endorsing other projects, perhaps to our detriment. But good things seem to be happening with the psycopg2 project, especially the recent license change.

10. Graph database features (3). At least one person observed that you can do some of what is wanted here using common table expressions, also known as WITH queries, which are supported beginning in PostgreSQL 8.4. But several people seem to feel that we should have more graph support; one poster mentioned algorithms such as all-pairs shortest paths. I don't have a clear idea of what is needed here, but I suspect that some of the things people are looking for here could be implemented as PostgreSQL extensions. It would be interesting to see some more detailed requirements.

11. Documentation (3). I was surprised to see several requests for documentation among the comments, since I have generally found the PostgreSQL documentation to be superb and one of the great strengths of the project. But one poster did hit on an issue which I think is entirely legitimate: if you Google something like "PostgreSQL documentation", you get a link to our main documentation. But if you Google "PostgreSQL ALTER TABLE", you get a link to the documentation for ALTER TABLE in PostgreSQL 8.1, whereas you might hope to get a link to the 8.4 version of the documentation for that command. And if you Google "PostgreSQL setting", well, let's just say you don't get a link to a page that tells you how to change PostgreSQL settings. If you actually go to the documentation page and navigate through it manually, it's quite easy to find what you're looking for, but there must be something about our site that make Google fail to grok it properly.

Still another poster was looking for documentation in Chinese. Currently, I believe that we maintain documentation only in English due to the rather large translation effort that would be involved in keeping documentation up to date in multiple languages. In fact, we currently don't even ship Chinese translations of our error messages, due to the fact that our existing set of translations is too incomplete. If you would like to help localize PostgreSQL for your native language, please see our wiki page on NLS. Volunteers are needed!

5 comments:

  1. Funny that you should mention the Chinese translation. EnterpriseDB actually just sponsored the zh_CN translation, and Bin Wei submitted the .po files for 8.4 branch just last week! I hope to see them included in the next minor release. The backend .po file doesn't show up in the NLS status page yet because there was some errors in the first attempt, but Bin Wei submitted an updated file since which is waiting to be committed in the patch tracker.

    ReplyDelete
  2. Regarding drivers (I suggested the official Python driver).

    I think psycopg2 is great work. What I really mean by official drivers is:

    If I need drivers for $LANGUAGE I'd like to go to $language.postgresql.org -> download and find them.

    I'm not really talking about (taking Python as an example here since I know it best) a full blown DB-API 2.0 driver - if not appropriate. Rather something that provides ready to use python methods bases on what the recommendetions for libpq are. The DB-API 2.0 layer will popup eventually...

    I believe Federico Di Gregorio did great work, and certainly if there was a project in the python world to start an official driver he should be the one to oversee it.

    ReplyDelete
  3. patryk.kordylewskiMay 11, 2010 2:04 PM

    Maybe the problem with the old(er) manual pages in the google search index could be solved by adding a "canonical" tag and let it point to the "current" version of a manual page. That shouldn't be that hard to implement and could solve the problem (point the search engine to the most recent version of a page).

    ReplyDelete
  4. IMHO Merge, Partitions and Parallel queries are key for the large datawarehouse market.

    ReplyDelete
  5. "Automatic maintenance of CLUSTER order would help, too, but I am not aware that anyone is currently working on that project."

    We have found that we need to regularly run CLUSTER in order to keep our performance up-to-snuff, but since our platform is used 24/7, we need to manually pick a time when our platform is not being actively used to run CLUSTER... This is becoming a real headache for us... I'm surprised that others aren't putting this near the top of the list of requested improvements.

    ReplyDelete