Musings of an anonymous geek

May 16, 2007

Using TRUNCATE to empty a PostgreSQL database

Filed under: Database,Me stuff,Sysadmin,Technology — m0j0 @ 10:22 am

This is not something that’s any big hack or secret, but emptying a database of all data without dropping the structure along with it is one of those tasks that I do just often enough in my development work to be annoying. If you ask me, there should just be a big ol’ “EMPTY” statement you can apply to an entire database.

You *can* empty a database table in PostgreSQL using an unqualified DELETE statement, by the way – but it takes longer because it does a full scan of each table. TRUNCATE just nukes everything – and if you feed it the ‘CASCADE’ keyword, it’ll nuke everything in its path as well. This is nice, because I have a bunch of tables in my database, but I know that there are a relatively small collection of tables that everything else links to, so I can pass about 10 table names to TRUNCATE, and giving it the ‘CASCADE’ argument will wipe out about 2 dozen tables.

While I love writing code that creates stuff, writing code to do demolition is somehow amazingly satisfying as well.

Technorati Tags: , , , , ,

Social Bookmarks:

May 13, 2007

Freebase: Your database is ready!

Filed under: Big Ideas,Database,Me stuff,Technology — m0j0 @ 10:51 am

This is going to be really frickin’ cool. There’s just no other way to put it. Maybe I’m a little too much of a data geek, because I can’t seem to sit still since receiving my email letting me know that Freebase is now in alpha, and the account I requested months ago can now be activated. I logged and immediately started poking around. I’ve been doing that for about 48 hours straight now.

 What is ‘Freebase’?

Well, the short answer is that Freebase is a public domain relational database maintained by the community. If this sounds like Wikipedia, don’t get to attached to that comparison. It’s true that Wikipedia is also maintained by its users, but that’s where the similarities end. You see, while Wikipedia stores information in a way that makes it attractive and easy for humans to find things, Freebase provides the kind of structure and relational characteristics that make it useful to application developers (programmatic access). It provides a relational database, which is typically used by programs, instead of an encyclopedia, which is used by people.

If you’re a DBA, your first thought might be that these are people who are trying to take your job. Not so. This is in no way suitable for internal, private, corporate, proprietary data. In fact, I don’t believe it’s even allowed. What it *is* good for is applications that can make use of publicly available and/or publicly maintained data. For example, a sample application called “Concierge” allows users to browse restaurants in their area by first telling the application the area they live in, and then the type of restaurant they’re looking for. The data about the restaurants is all stored in the publicly maintained Freebase database, and Concierge also provides an interface for users to add new restaurants, which adds the data back to Freebase.

Me as a working example of a typical Freebase geek 

I myself am fleshing out the Freebase “beer”, “beer style”, and “beer style category” types, in the hopes that I can provide a web interface that allows beer enthusiasts and brewers alike to know more about beer, and helps them to develop recipes for their own beer. I’m using the BJCP (Beer Judge Certification Program) published beer style guidelines to flesh things out, and then other people can come in and start associating beers with styles and breweries and all kinds of other characteristics. Having the beer style definitions held in a public place means that, as the style guidelines evolve, people who care about such things are free to make updates to the data in Freebase. This means the data used by my application is always up to date, and I never have to push out updates to users just to update their local copy of this data.

It may also mean that I can afford to make the application available to large numbers of people for free, since I won’t have to find a hosting plan that lets me house however many gigabytes of data this thing grows to. I can eventually work in all of the properties of different hop varieties, different grain characteristics, yeast attenuation rates, water profiles of different beer brewing regions, etc. Further, I don’t have to be an expert on every facet of beer, because people who care about, say, yeast attenuation are probably going to populate that data for me anyway, whether its specifically for my application, or one of their own.

The Future of Freebase (Pre-“Total World Domination”) 

Aside from simple publicly maintained data, I think there are other implications of this model. For example, with all kinds of applications using Freebase as the back end database, and considering that users can have their own “private domain” data type definitions, it makes sense for applications to use the users’ Freebase credentials to maintain application preferences data using that domain. This would seem to make Freebase a contender for a de facto standard OpenID or CAS portal.

If the model is Earth-shaking, it stands in contrast to the current state of the actual user interface. Oh, I’m getting by just fine, but that’s more in spite of the interface than because of it. Some of the ajax-y features hurt as often as they help, navigation needs a little improvement, and there’s no way to add massive amounts of data quickly that I’ve found yet without writing code.

Further, it’s still unclear to me how they plan to foster cooperation between people who want to relate different data types. For example, I quite naturally want to list, for each beer, the brewery that makes the beer. Someone else has already created a “brewery” type, and the community has done a darn good job at fleshing out the data for that type. However, when you go and look at a brewery definition, there is no listing of the beers produced by that brewery. Freebase certainly supports the idea of a “reciprocal link” that would cause beers to show up under “brewery” entries as people add beer definitions and fill in the “brewery” property of the beer. However, there’s no clear rules on how to get this reciprocal link to happen if you’re not the creator of all of the types involved.

What’s more, I’m not the only person who has created a “beer” type. Which one should the “brewery” type administrator link to? Well, this wouldn’t be as big a problem if I were allowed to add a property to an existing beer type! Then I wouldn’t have to create a competing type at all! Currently, this is not allowed. I cannot go redefining the properties associated with a type that’s maintained by someone else. As a result, in order to support properties of beer that brewers and enthusiasts care about, I have to strike out on my own and hope that in the long run, my “beer” type becomes “the” beer type.

This should really all be opened up, and people should be allowed to add properties that are submitted for approval by the type administrator. Reciprocal links should be put in a “pending” state, or maybe even a “probationary” state. These are features that would encourage more interaction between users who care about the same data, and foster a community around the data that community cares about.

I’m sure there are plenty of other things to think about as well. For example, will Freebase let me upload the Briess malt profiles, published by one of the biggest maltsters in the US? Briess may have a problem with that – but how will Freebase know without receiving a cease and desist from some friendly neighborhood lawyers? Then there are technical and financial details. Presumably, they’ll either charge applications that use Freebase for commercial gain, or they’ll have to charge for some higher service level in order to guarantee that data will be available for applications to use.

This is not a simple service. I’ll say this though: I wish I could buy stock in Freebase, if only to cash out when they are inevitably purchased by Google.

Technorati Tags: , , , , , , , , ,

Social Bookmarks:

March 29, 2007

View Vindication

Filed under: Database,Technology — m0j0 @ 9:43 pm

So, about 9 months ago, I worked with a team of researchers. They were building a pretty hardcore global distributed system, and associated management infrastructure. My job was to simply advise them on issues revolving around how they use their database back end. For the most part, I just made suggestions here and there about normalizing their schema and helping them adjust a few queries to deal with the normalization. But there was this one thing….

I told them that if they had certain views of the data that were used all over the place by their applications, they might consider creating “views” in the database to provide that view of the data without denormalizing the data in the base tables. The team quickly caught on, and I left them to it. A few days later, it was clear that I had created a monster. One of the team members had created some views that were… unique. He was using them as a means of storing not only complex queries, but also some logic to manipulate the data, and format it for him. And, of course, since different functions might use the same basic data, just formatted or manipulated in a different way, there must’ve been lots of these views. And… well, I think you get the picture.

I pointed out that this was bad news on so many levels that to move in this direction was just No Bueno(tm) all over the place. There was like a week-long email thread involving most of the development team at the time, and in the end, I told them that if they still wanted to go this route, after everything I had told them about database design and usage in general, then at least they were making that decision with the knowledge of the possible consequences, and so my job, really, was done. I knew I had at least convinced one of the guys, and he fought tooth and nail with the other guy (they sat right next to eachother). I figured they’d come to some compromise and find a sane way to move forward. My time with them was over and I moved head first right into another project.

Since that time, there was a little turnover in the group. A couple of guys left, and a *few* new guys were brought on. Just a couple of weeks ago, I spoke to one of them about doing a database cluster to work around some performance issues. Then, I got an email from another guy saying they were trying to find the source of some database slowness. This is months after I left the group, so I never dreamed that this had anything to do with “the views”.

Then I got this IM from some nick I didn’t recognize saying “after all these months of fighting, we’ve finally proven that you were right all along”. The guy called me by name, but I didn’t know the nick, so I said “um. What?”. And he proceeded to tell me that, after removing the “views of death” that were put in place in spite of the many super-long email-based database lectures I had sent, their database performance woes were gone. The load on their database server went from “maxed out” to under 5% immediately. The load graph was astounding.

I had no idea, and in fact could not fathom, that these views would ever make it to production. Vindication is good.

For the record, it’s not views themselves that are bad. They’re *good*. In fact, I’m the one who suggested using views! It was the misuse of views that killed the database.

Technorati Tags: , , , , , ,

Social Bookmarks:

March 16, 2007

Trying to make friends with Python… again

Filed under: Database,Linux,Scripting,Sysadmin,Technology — m0j0 @ 6:42 am

I like the idea of Python. I have diverse interests, technically, and I like to think that there’s a language out there that I can use to write small script, a large website, a stored procedure, or a distributed system. The same language is used to write a very large chunk of systems code on Red Hat systems can also be used to make pretty graphical interfaces. I like that it’s cross platform.

My trouble with Python has been twofold: time, and support. I actually *have* read the introductory tutorial, but it was in 2002. I’ve forgotten just about all of it. I have a copy of the printed Python Reference Library, but it’s from 2000 (if memory serves). I own *both* editions of “Learning Python”, because by the time I got around to reading the first edition, the second edition made it completely obsolete. The other side of the time issue was making time to actually do something useful with the language so as to cement the fundamentals into my brain. That’s sometimes difficult when you’re a sysadmin and don’t really program for a living.

On the support side, I’ve had a lot of problems. Every time I go to do something with Python, I have no idea which route to take. There are so many frameworks and modules that have overlapping problem scopes that it’s hard for me to make a decision. What’s worse, nobody seems to know which module or framework is the canonical way of doing things. I guess things are still young enough to be schizophrenic. With Perl, when they say “there’s more than one way to do it”, that’s speaking more about the syntax of the language than the modules you might use (though it speaks to that, too, somewhat). With Python, the syntax is the (relatively) stable part – it’s choosing modules that can be a challenge.

Right now I’m building an XML-RPC server and a small test client. The client calls functions on the server, and in response, the server queries a PostgreSQL database and returns the results. I got a simple working prototype working with real data yesterday, but it took me a long time to figure out exactly which module should be used to talk to PostgreSQL from Python, and which module should be used for implementing the XML-RPC server. I’m comfortable with psycopg2 for the database calls, but I’m using SimpleXMLRPCServer for the server implementation, and I’m just waiting for one of its limitations to bite me. However, Twisted doesn’t seem like it’s quite soup yet in this particular area, and using xmlrpclib to implement a server seems silly with a ready made solution already built in (I know a project that does that, maybe because SimpleXMLRPC didn’t exist at the time they started?).

So, wish me luck. If you have any input on what you’ve done in this area with Python, fill me in! Also, if you’re an admin who uses Python and knows of a good reference site for simple day-to-day UNIX admin scripting in Python, let me know that too!

Technorati Tags: , , , , , , , , , , , , , , ,

Social Bookmarks:

February 19, 2007

How to replace a view, a cursor loop, and a few other queries with one SQL statement

Filed under: Database,Scripting,Technology — m0j0 @ 1:38 pm

I had a stored procedure that defined a cursor using a crazy query, and then a LOOP that did a FETCH and went row-by-row and performed updates and inserts to 3 or 4 different tables. I was having some trouble completing the very last thing the procedure needed to do. So I figured I’d take a little ‘learning break’ and do s’more research on what I was trying to accomplish.

I came across Roland Bouman’s blog, and read quite a few of the articles there. He’s a good presenter of his knowledge, and would appear to be incredibly knowledgeable about all things SQL-related, and something in those articles got me thinking that what I was trying to do may not be the greatest use of the DB server’s resources.

I had a cursor that fetched something like 350 rows for my loop to process. On average, probably fewer than 5 of those 350 rows would point to changes the procedure would need to make (either inserts or updates). Further, the cursor query depended on three views I had created, in part to work around shortcomings in an application’s database design, and in part to map external data to id’s defined in the application. This was quickly turning into a steaming heap of rotten garbage that I was going to have to provide for the care and feeding of. I had entered the No Bueno Zone™

I got s’more ideas from Roland’s blog, thought on this for the weekend, futzed around with some stuff, fell down a few times, and then, first thing this morning, I came in, sat down, stroked my chin for about 5 minutes, did some typing, and then sat back. I inspected the code. I looked at the output of a few other small queries. I tested the code, and immediately put my arms in the air signaling the touchdown that had just been achieved.

Here’s the final task I needed to accomplish in the stored procedure. Thanks again to Roland for sharing his knowledge on his blog. It was very useful.

UPDATE groups_members AS grpmem,
    (
        SELECT
            u.id AS uid,
            g.id AS newgid,
            m.groupid AS oldgid
        FROM
            bb_data_map AS b
            JOIN user AS u ON u.username = b.netid
            JOIN course AS c ON c.idnumber = b.coursename
            JOIN groups AS g ON g.name = b.precept AND g.courseid = c.id
            LEFT JOIN
                (
                    SELECT
                        gm.userid AS userid,
                        gm.groupid AS groupid,
                        g.name AS grpname,
                        c.idnumber AS coursename
                    FROM
                        groups_members AS gm,
                        groups AS g,
                        course AS c
                    WHERE g.id = gm.groupid
                        AND c.id = g.courseid
                 ) AS m ON m.userid = u.id AND m.coursename = b.coursename
        WHERE b.precept IS NOT NULL
        AND c.id IS NOT NULL
        AND m.groupid != g.id
    ) AS newgrp
    SET grpmem.groupid = newgrp.newgid
    WHERE (grpmem.userid = newgrp.uid AND grpmem.groupid = newgrp.oldgid)

Technorati Tags: , , , , , ,

Social Bookmarks:

Create a free website or blog at WordPress.com.