Isura's Blog: 2012

Thursday, August 30, 2012

Pentaho - Put a file with FTP

Following Kettle job was design to generate survey result file and statistic file. This job loop through several folders and export output into a csv file, then send it to a FTP site.

In FTP step, IP, port, username and password need to be give.

In file tab,a local directory is needed to be given instead of a file path or mapped directory. In wildcard area, file name or regular expression to pick up the right files needed to be given.

Friday, June 15, 2012

Reset all sequences' current values + PostgreSQL

Following function allows to reset starting values of all sequences in particular PostgreSQL schema. In data migration process, this function can be used to reset all sequence values after importing data to destination database.
CREATE OR REPLACE FUNCTION seq_update()
RETURNS void AS
$BODY$
Declare
tab1 varchar;
col1 varchar;
seqname1 varchar;
maxcolval integer;
ssql varchar;
BEGIN
FOR tab1, col1, seqname1 in Select distinct constraint_column_usage.table_name,
constraint_column_usage.column_name,
replace(replace(columns.column_default,'''::regclass)',''),'nextval(''','')
From information_schema.constraint_column_usage, information_schema.columns
where constraint_column_usage.table_schema ='public' AND
columns.table_schema = 'public'
AND columns.table_name=constraint_column_usage.table_name
AND constraint_column_usage.column_name = columns.column_name
AND columns.column_default is not null
AND constraint_column_usage.table_name not in ('user', 'usermodulespages')
--AND constraint_column_usage.table_name = 'role'
order by 1
LOOP
ssql := 'select max(' || col1 || ') from ' || tab1 ;
RAISE NOTICE 'SQL : %', ssql;
execute ssql into maxcolval;
RAISE NOTICE 'max value : %', maxcolval;
EXECUTE 'alter sequence ' || seqname1 ||' restart with ' || maxcolval;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
Following command can be used to run the function
select * from seq_update()

Monday, June 11, 2012

SQL Query Optimization Techniques

1) SQL query became fast if you select actual columns instead of selecting all values (*)
2) HAVING clause is used to filter the rows after all the rows are selected. It is just like a filter. Do not use HAVING clause for any other purposes.
For Example: Write the query as
SELECT subject, count(subject) FROM student_details WHERE subject != 'Science' AND subject != 'Maths' GROUP BY subject;
Instead of:
SELECT subject, count(subject) FROM student_details GROUP BY subject HAVING subject!= Vancouver' AND subject!= 'Toronto';

3) Try to minimize the number of subquery block in your query.
For Example: Write the query as
SELECT name FROM employee WHERE (salary, age ) = (SELECT MAX (salary), MAX (age) FROM employee_details) AND dept = 'Electronics';
Instead of:
SELECT name FROM employee WHERE salary = (SELECT MAX(salary) FROM employee_details) AND age = (SELECT MAX(age) FROM employee_details) AND emp_dept = 'Electronics';

4) Use operator EXISTS, IN and table joins appropriately in your query.
a) Usually IN has the slowest performance.
b) IN is efficient when most of the filter criteria is in the sub-query.
c) EXISTS is efficient when most of the filter criteria is in the main query.
For Example: Write the query as
Select * from product p where EXISTS
(select * from order_items o where o.product_id = p.product_id)
Instead of:
Select * from product p where product_id IN (select product_id from order_items

5) Use EXISTS instead of DISTINCT when using joins which involves tables having one-to-many relationship. For Example: Write the query as
SELECT d.dept_id, d.dept FROM dept d
WHERE EXISTS ( SELECT 'X' FROM employee e WHERE e.dept = d.dept);
Instead of:
SELECT DISTINCT d.dept_id, d.dept FROM dept d,employee e WHERE e.dept = e.dept;

6) Use of WHERE clause.
Symbol operator: For Example: Write the query as
SELECT id, first_name, age FROM student_details WHERE age > 10;
Instead of:
SELECT id, first_name, age FROM student_details WHERE age != 10;

Wildcard vs sub-string: Write the query as:
SELECT id, first_name, age FROM student_details WHERE first_name LIKE 'Chan%';
Instead of:
SELECT id, first_name, age FROM student_details WHERE SUBSTR(first_name,1,3) = 'Cha';

Write the query as:
SELECT id, first_name, age FROM student_details WHERE first_name LIKE NVL ( :name, '%');
Instead of:

SELECT id, first_name, age FROM student_details WHERE first_name = NVL ( :name, first_name);

Write the query as:

SELECT id, first_name, age FROM student_details WHERE first_name LIKE NVL ( :name, '%');

Instead of:

SELECT id, first_name, age FROM student_details WHERE first_name = NVL ( :name, first_name);

Write the query as:

SELECT product_id, product_name FROM product

WHERE unit_price BETWEEN MAX(unit_price) and MIN(unit_price)

Instead of:

SELECT product_id, product_name FROM product WHERE unit_price >= MAX(unit_price)

and unit_price <= MIN(unit_price)

Write the query as:

SELECT id, name, salary FROM employee WHERE dept = 'Electronics' AND location = 'Bangalore';

Instead of:

SELECT id, name, salary FROM employee WHERE dept || location= 'Electronics Bangalore';

Use non-column expression on one side of the query because it will be processed earlier.

Write the query as:

SELECT id, name, salary FROM employee WHERE salary < 25000;

Instead of:

SELECT id, name, salary FROM employee WHERE salary + 10000 < 35000;

Write the query as:

SELECT id, first_name, age FROM student_details WHERE age > 10;

Instead of:

SELECT id, first_name, age FROM student_details WHERE age NOT = 10;

7) Utilize Union instead of OR
Indexes lose their speed advantage when using them in OR-situations
Write the query as:

SELECT * FROM TABLE WHERE COLUMN_A = 'value' OR COLUMN_B = 'value'

Instead of:

SELECT * FROM TABLE WHERE COLUMN_A = 'value'
UNION
SELECT * FROM TABLE WHERE COLUMN_B = 'value'

8) Use DECODE to avoid the scanning of same rows or joining the same table repetitively. DECODE can also be made used in place of GROUP BY or ORDER BY clause.

For Example: Write the query as

SELECT id FROM employee WHERE name LIKE 'Ramesh%' and location = 'Bangalore';

Instead of:

SELECT DECODE(location,'Bangalore', id, NULL) id FROM employee WHERE name LIKE 'Ramesh%';

10) To store large binary objects, first place them in the file system and add the file path in the database.

11) To write queries which provide efficient performance follow the general SQL standard rules.

a) Use single case for all SQL verbs

b) Begin all SQL verbs on a new line

c) Separate all words with a single space

d) Right or left aligning verbs within the initial SQL verb

Reference: http://beginner-sql-tutorial.com/sql-query-tuning.htm

http://hungred.com/useful-information/ways-optimize-sql-queries/

Tuesday, May 1, 2012

Manual Migration SQLSERVER to PostgreSQL

Following article explains a manual migration of a small SQL Server database structure to a postgres SQL
Create table scripts

1. Force to lower case (Notepad++ -> edit -> convert case to -> lower)

2. Remove all square brackets ( [ and ] ), owner prefixes (i.e. "dbo.").

3. Remove all non-supported keywords (i.e. "WITH NOCHECK", "WITH CHECK", "CLUSTERED", “ASC”, “with fillfactor = 90”, “ON PRIMARY”, “TEXTIMAGE_ON”)

4. Remove unsupported commands (entire rows) using following regular expression (Notepad++ for small scripts)

· ^.*//******.*$

· ^.*SET ANSI_NULLS.*$

· ^.*SET QUOTED_IDENTIFIER.*$

5. Replace the T-SQL batch terminator "GO" with the PostgreSQL batch terminator ";" and manually remove additional “;” OR remove all “GO” statements and add “;” for each sql statement.

6. Replace all non-supported data types

Nvarchar, text, ntext -- varchar
DATETIME, Smalldatetime -- TIMESTAMP
MONEY -- NUMERIC(19,4)
Bit (Boolean type) -- smallint or boolean
Smalldatetime -- TIMESTAMP
Float -- bigint
decimal -- numeric
Tinyint -- smallint
Int -- integer
Image -- bytea

7. uniqueidentifier data type

Operation	PostgreSQL	SQL Server
Create a table that has a UUID/GUID type automatically assigned	CREATE TABLE equipment( equip_id uuid PRIMARY KEY DEFAULT uuid_generate_v1() , equip_name varchar(255));	CREATE TABLE equipment( equip_id uniqueidentifier PRIMARY KEY DEFAULT NEWID() , equip_name varchar(255)); CREATE TABLE equipment( equip_id uniqueidentifier PRIMARY KEY ROWGUIDCOL , equip_name varchar(255) );
Create a table that has text representation of UUID/GUID as primary for easier transportation	CREATE TABLE equipment( equip_id char(32) PRIMARY KEY DEFAULT LOWER( REPLACE( CAST( uuid_generate_v1() As varchar(50)) , '-','') ) , equip_name varchar(255));	CREATE TABLE equipment( equip_id char(32) PRIMARY KEY DEFAULT LOWER( REPLACE( CAST(NEWID() As varchar(50)), '-','') ) , equip_name varchar(255));

8. Instead of int IDENTITY(1,1) for an auto incrementing column, Postgres uses sequences, the easiest way to transform this is to use the serial (or bigserial for 8 byte numbers) data type If you need to start the sequence at a specified seed or have it decrement or increment by more than 1, you must create the sequence manually.

CREATE SEQUENCE <> START <> INCREMENT <> ;
And then assign the default value for the column to use this sequence, using thenextval function:CREATE TABLE ( int default nextval('<>') not null;)

9. Find “DEFAULT (0)” value and add that manually at the end of relevant field in relevant table. Then remove “ALTER TABLE ……….. CHECK CONSTRAINT ……… DEFAULT (0) FOR …….” Rows

10. Remove “ALTER TABLE ……….. CHECK CONSTRAINT ………” rows

11.Change the index statements

SQL Server index

create nonclustered index ix_tblstudiesfieldstatus on studies (field_status_id );

Postgres Index

CREATE INDEX ix_tblstudiesfieldstatus ON ccadmin.studies USING tree( field_status_id ); ALTER TABLE ccadmin.studies CLUSTER ON times_idx;

12. Replace “getdate()” with “now()”

13. Replace “len” with “length”

Example

SQL Server Table

Postgre SQL Table

/****** Object: Table [dbo].[WEBINC_TRACK] Script Date: 05/03/2012 11:07:38 ******/

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[WEBINC_TRACK](

[SESSIONID] [uniqueidentifier] NOT NULL CONSTRAINT [DF_WEBINC_TRACK_SESSIONID] DEFAULT (newid()),

[ID] [int] IDENTITY(1,1) NOT NULL,

[TrackDate] [smalldatetime] NOT NULL CONSTRAINT [DF_WEBINC_TRACK_TrackDate] DEFAULT (getdate()),

[pathinfo] [varchar](1000) NULL,

[loginpathinfo] [varchar](1000) NULL,

[EMAILSENT] [tinyint] NULL,

CONSTRAINT [PK_WEBINC_TRACK] PRIMARY KEY CLUSTERED

([ID] ASC )WITH FILLFACTOR = 90 ON [PRIMARY]

) ON [PRIMARY]

Drop table if exists webinc_track;

create table webinc_track(

sessionid varchar(36) not null constraint df_webinc_track_sessionid default upper(cast(uuid_generate_v4() as varchar(36))),

id serial not null,

trackdate TIMESTAMP not null constraint df_webinc_track_trackdate default (now()),

puin varchar(255) null,

pathinfo varchar(1000) null,

loginpathinfo varchar(1000) null,

emailsent smallint null,

constraint pk_webinc_track primary key (id));

SQL Server Function

Postgre SQL function

/****** Object: UserDefinedFunction [dbo].[GetGenderFromSalutation] Script Date: 05/09/2012 16:52:18 ******/

SET ANSI_NULLS OFF

SET QUOTED_IDENTIFIER OFF

CREATE FUNCTION [dbo].[GetGenderFromSalutation](@SalutationID smallint)

RETURNS smallint AS

BEGIN

declare @result smallint

set @result = 0

if @SalutationID in (3,4,7,2)

set @result = 2

else if (@SalutationID = 1)

set @result = 1

return @result

END

CREATE OR REPLACE FUNCTION GetGenderFromSalutation(SalutationID smallint)

RETURNS smallint AS

$BODY$

DECLARE

result smallint := 0 ;

BEGIN

IF (SalutationID in (3,4,7,2)) THEN

result := 2;

ELSIF (SalutationID = 1) THEN

result := 1;

END IF;

RETURN result;

END;

$BODY$

LANGUAGE plpgsql VOLATILE

Procedures and functions

Apply same rules for procedures and function as applied in table creation scripts. Additionally follow following point.

1) Instead of “create function” use “Create or replace function”

2) Add “AS $BODY$ DECLARE” after returning data type at beginning

3) Remove individual declare phase from variable declaration and add “;” at end of each variable declaration statement.

4) Add “then” and “END IF;” word in all if conditional blocks. Further replace “else if” phase with “ELSIF”

5) Remove “SET” and “@” words

6) Add “$BODY$ LANGUAGE plpgsql VOLATILE” at the end of the function

7) PostgreSQL functions do not support output parameters, avoid them.

8) PostgreSQL functions do not support default values, all parameters must be passed to the function (although function overtyping is supported, allowing wrapper functions that allow defaults).

9) Use FOUND instead of @@FETCH_STATUS. Found is true if the last SELECT, SELECT INTO, or FETCH statement returned a value.

10) Instead of @@IDENTITY use currval(sequence name) by default the sequence name is __seq.

11) Remove DEALLOCATE cursor statements, the function is accomplished with CLOSE and DEALLOCATE has a different use.

12) Use PERFORM instead of EXEC to call stored procedures/functions.

13) Forvariable declarations, SELECT foo = foo column won't work. Use either SELECT INTO or the PL/pgsql definition := (as in foo := foocolumn.)

14) Error handling is different, when a PL/pgsql program encounters a nerror the execution is immediately stopped(can't check @@ERROR) or execution isnt to the EXCEPTION block (if it exists) after the current BEGIN block. Thus, if you want to return specific error values, you will need an EXCEPTION block.

Application level changes

1.Change "SELECT TOP 10 * FROM table" queries in to "SELECT * FROM table LIMIT 10" structure

2. LIKE statements are case sensitive in postgresql

3. concatination so "SELECT firstname + ' ' + lastname AS fullname" becomes "SELECT firstname || ' ' || lastname AS fullname"

Reference - http://www.postgresonline.com/journal/archives/179-Universal-Unique-Identifiers-PostgreSQL-SQL-Server-Compare.html+

http://wiki.postgresql.org/images/e/e4/5.pdf

Monday, March 26, 2012

Python basic tutorial

#genaral
C:\Python27\ python   #cmd prompt sart
import os # import a module
dir(os) # list all parameters in os module
os.listdir("C:\users\iss") # list * files in a folder
help(modulename)   #To get the docs on all the functions at once

import aa #import a method 1
aa.get_directory(aa.drive)
from aa import *   #import a method 2
get_directory(drive)

#----------------------------------------------------------------
# String Formatting
name = "John"
age = 23
print "%s is %d years old." % (name, age)

mylist = [1,2,3]
print "A list: %s" % mylist # List print

%s - String (or any object with a string representation, like numbers)
%d - Integers
%f - Floating point numbers
%. f - Floating point numbers with a fixed amount of digits to the right of the dot.
%x/%X - Integers in hex representation (lowercase/uppercase)

s = "Hey there! what should this string be?"
print "Length of s = %d" % len(s) # Length = 38
print "The first occurrence of the letter a = %d" % s.index("a") # First occurrence of "a" =13
print "a occurs %d times" % s.count("a") # Number of a's =1
# Slicing the string into bits
print "The first five characters are '%s'" % s[:5] # Start to 5
print "The next five characters are '%s'" % s[5:10] # 5 to 10
print "The twelfth character is '%s'" % s[12] # Just number 12
print "The last five characters are '%s'" % s[-5:] # 5th-from-last to end

print "String in uppercase: %s" % s.upper() # Convert everything to uppercase
print "String in lowercase: %s" % s.lower()   # Convert everything to lowercase

# Check how a string starts
s = "Str Hey there! what should this string be? some"
if s.startswith("Str"):
print "String starts with 'Str'. Good!"
# Check how a string ends
if s.endswith("ome!"):
print "String ends with 'ome!'. Good!"
# Split the string into three separate strings
print "Split the words of the string: %s" % s.split(" ")

#Basic Operators-------------------------------------------------
number = 1 + 2 * 3 / 4.0
remainder = 11 % 3
squared = 7 ** 2
cubed = 2 ** 3
helloworld = "hello" + " " + "world"
lotsofhellos = "hello" * 10 # multiplying strings to form a string
even_numbers = [2,4,6,8]
odd_numbers = [1,3,5,7]
all_numbers = odd_numbers + even_numbers #Lists can be joined
print [1,2,3] * 3 #repeating sequence

#Conditions------------------------------------------------------
x = 2
print x == 2 # prints True
print x == 3 # prints False
print x < 3 # prints True

name = "John"
age = 23
if name == "John" and age == 23:
print "Your name is John, and you are also 23 years old."
if name == "John" or name == "Rick":
print "Your name is either John or Rick."

#Loop------------------------------------------------------------
#The "for" loop - Prints out the numbers 0,1,2,3,4
for x in xrange(5):
print x
# Prints out 3,4,5
for x in xrange(3,6):
print x

#"while" loops - Prints out 0,1,2,3,4
count = 0
while count < 5:
print count
count += 1
#break is used to exit a for loop or a while loop, whereas continue is used to skip the
#current block, and return to the "for" or "while"
count = 0
while True:
print count
if count > 5:
break
# Prints out only odd numbers - 1,3,5,7,9
for x in xrange(10):
# Check if x is even
if x % 2 == 0:
continue
print x

#Function---------------------------------------------------------
def func2(x, y, z, m):
print x+y+z+m
func2(1,2,3,4)

#function within fuunction
def wrapper1(func, *args): # with star
func(*args)
def wrapper2(func, args): # without star ..positional args
func(*args)
def func2(x, y, z, m):
print x+y+z+m

wrapper1(func2, 1, 2, 3, 4)
wrapper2(func2, [1, 2, 3, 4])

#multiple function argument
def foo(xxx, yyy, zzz, *mmm):
print "First: %s" % xxx
print "Second: %s" % yyy
print "Third: %s" % zzz
print "And all the rest... %s" % mmm
foo(1,2,3,4)

#classes and objects-----------------------------------------------
class MyClass:
variable = "blah" # define variable withing functtion
def function(self): # define variable
print "This is a message inside the class."

myobjectx = MyClass() # assign class to an object
print myobjectx.variable # access variable within class
print myobjectx.function() # access function within class

myobjecty = MyClass()
myobjecty.variable = "yackity"
print myobjecty.variable

#Dictionaries-------------------------------------------------------
#A dictionary is a data type similar to arrays, but works with keys and values instead of
#indexes. Each value stored in a dictionary can be accessed using a key, which is any type of
#object (a string, a number, a list, etc.) instead of using its index to address it

phonebook = {}
phonebook["John"] = 938477566
phonebook["Jack"] = 938377264
or
phonebook = {
"John" : 938477566,
"Jack" : 938377264
}
#dictionary, unlike a list, does not keep the order of the values stored in it.
for name, number in phonebook.iteritems():
print "Phone number of %s is %d" % (name, number)

#To remove a specified index, use either one of the following notations:
del phonebook["John"] or
phonebook.remove("John")

#Genaraters----------------------------------------------------------
#Generators are simple functions which return an iterable set of items, one at a time, in a special way.
import random
def lottery():
# returns 6 numbers between 1 and 40
for i in xrange(6):
yield random.randint(1, 40)
# returns a 7th number between 1 and 15
yield random.randint(1,15)

for random_number in lottery():
print "And the next number is... %d!" % random_number

#List Comprehensions--------------------------------------------------
#creates a new list based on another list, in a single, readable line
sentence = "the quick brown fox jumps over the lazy dog"
words = sentence.split()
word_lengths = [len(word) for word in words if word != "the"]
print words
print word_lengths

#regular expression----------------------------------------------------
#Search and Replace: Some of the most important re methods that use regular expressions is sub.
#re.sub(pattern, repl, string, max=0)
#replace all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided.
#This method would return modified string.
phone = "2004-959-559 #This is Phone Number"
num = re.sub(r'#.', "", phone) # replace "This" = 2004-959-559 #is Phone Number
num = re.sub(r'#.*$', "", phone) # replace All after # mark with "" = 2004-959-559
num = re.sub(r'\D', "", phone) # replace anything other than digits with "" = 2004959559
print "Phone Num : ", num

#http://www.tutorialspoint.com/python/python_reg_expressions.htm
#Pattern Description
#-----------------------------------------------------------------------------------------------------
^ Matches beginning of line.
$ Matches end of line.
. Matches any single character except newline. Using m option allows it to match newline as well.
[...] Matches any single character in brackets.
[^...] Matches any single character not in brackets
re* Matches 0 or more occurrences of preceding expression.
re+ Matches 1 or more occurrence of preceding expression.
re? Matches 0 or 1 occurrence of preceding expression.
re{ n} Matches exactly n number of occurrences of preceding expression.
re{ n,} Matches n or more occurrences of preceding expression.
re{ n, m} Matches at least n and at most m occurrences of preceding expression.
a| b Matches either a or b.
(re) Groups regular expressions and remembers matched text.
(?imx) Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?: re) Groups regular expressions without remembering matched text.
(?imx: re) Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re) Temporarily toggles off i, m, or x options within parentheses.
(?#...) Comment.
(?= re) Specifies position using a pattern. Doesn't have a range.
(?! re) Specifies position using pattern negation. Doesn't have a range.
(?> re) Matches independent pattern without backtracking.
\w Matches word characters.
\W Matches nonword characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches nonwhitespace.
\d Matches digits. Equivalent to [0-9].
\D Matches nondigits.
\A Matches beginning of string.
\Z Matches end of string. If a newline exists, it matches just before newline.
\z Matches end of string.
\G Matches point where last match finished.
\b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\B Matches nonword boundaries.
\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
\1...\9 Matches nth grouped subexpression.
\10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

File handler and line counts
f = open('received/20000402172012MARK0OUT.DAT', 'r')
s = open('received/20000402172012MARK0.NA', 'r')
num_lines = sum(1 for line in f)
num_liness= sum(1 for line in s)
print num_lines + num_liness

Pages

Thursday, August 30, 2012

Pentaho - Put a file with FTP

Friday, June 15, 2012

Reset all sequences' current values + PostgreSQL

Monday, June 11, 2012

SQL Query Optimization Techniques

Tuesday, May 1, 2012

Manual Migration SQLSERVER to PostgreSQL

Monday, March 26, 2012

Python basic tutorial