Pages

Thursday, August 30, 2012

Pentaho - Put a file with FTP

Following Kettle job was design to generate survey  result file and statistic file. This job loop through several folders and export output into a csv file, then send it to a FTP site.

 In FTP step, IP, port, username and password need to be give.
 In file tab,a local directory is needed to be given instead of a file path or mapped directory. In wildcard area, file name or regular expression to pick up the right files needed to be given.


Friday, June 15, 2012

Reset all sequences' current values + PostgreSQL

Following function allows to reset starting values of all sequences in particular PostgreSQL schema. In data migration process, this function can be used to reset all sequence values after importing data to destination database.
CREATE OR REPLACE FUNCTION seq_update()
  RETURNS void AS
$BODY$
Declare
tab1 varchar;
col1 varchar;
seqname1 varchar;
maxcolval integer;
ssql varchar;
BEGIN
FOR tab1, col1, seqname1 in Select distinct constraint_column_usage.table_name, 
constraint_column_usage.column_name,
replace(replace(columns.column_default,'''::regclass)',''),'nextval(''','')
 From information_schema.constraint_column_usage, information_schema.columns
 where constraint_column_usage.table_schema ='public' AND 
   columns.table_schema = 'public' 
    AND columns.table_name=constraint_column_usage.table_name
AND constraint_column_usage.column_name = columns.column_name
AND columns.column_default is not null 
AND constraint_column_usage.table_name not in ('user', 'usermodulespages')
--AND constraint_column_usage.table_name = 'role'
order by 1 
LOOP
ssql := 'select max(' || col1 || ') from ' || tab1 ;
RAISE NOTICE 'SQL : %', ssql;
execute ssql into maxcolval;
RAISE NOTICE 'max value : %', maxcolval;
EXECUTE 'alter sequence ' || seqname1 ||' restart  with ' || maxcolval;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
Following command can be used to run the function
select * from seq_update()

Monday, June 11, 2012

SQL Query Optimization Techniques

1) SQL query became fast if you  select actual columns instead of selecting all values (*)
2) HAVING clause is used to filter the rows after all the rows are selected. It is just like a filter. Do not use HAVING clause for any other purposes. 
For Example: Write the query as
SELECT subject, count(subject) FROM student_details WHERE subject != 'Science' AND subject != 'Maths' GROUP BY subject;
Instead of:
SELECT subject, count(subject) FROM student_details GROUP BY subject HAVING subject!= Vancouver' AND subject!= 'Toronto';


3) Try to minimize the number of subquery block in your query. 
For Example: Write the query as
SELECT name FROM employee WHERE (salary, age ) = (SELECT MAX (salary), MAX (age) FROM employee_details) AND dept = 'Electronics'; 
Instead of:
SELECT name FROM employee WHERE salary = (SELECT MAX(salary) FROM employee_details) AND age = (SELECT MAX(age) FROM employee_details) AND emp_dept = 'Electronics';


4) Use operator EXISTS, IN and table joins appropriately in your query. 
    a) Usually IN has the slowest performance. 
    b) IN is efficient when most of the filter criteria is in the sub-query. 
    c) EXISTS is efficient when most of the filter criteria is in the main query.
For Example: Write the query as
Select * from product p where EXISTS 
(select * from order_items o where o.product_id = p.product_id)
Instead of:
Select * from product p where product_id IN (select product_id from order_items


5) Use EXISTS instead of DISTINCT when using joins which involves tables having one-to-many relationship. For Example: Write the query as
SELECT d.dept_id, d.dept FROM dept d 
WHERE EXISTS ( SELECT 'X' FROM employee e WHERE e.dept = d.dept);
Instead of:
SELECT DISTINCT d.dept_id, d.dept FROM dept d,employee e WHERE e.dept = e.dept;


6) Use of WHERE clause.
Symbol operator: For Example: Write the query as
SELECT id, first_name, age FROM student_details WHERE age > 10;
Instead of:
SELECT id, first_name, age FROM student_details WHERE age != 10; 


Wildcard vs sub-string: Write the query as:
SELECT id, first_name, age FROM student_details WHERE first_name LIKE 'Chan%';
Instead of:
SELECT id, first_name, age FROM student_details WHERE SUBSTR(first_name,1,3) = 'Cha';


Write the query as:
SELECT id, first_name, age FROM student_details WHERE first_name LIKE NVL ( :name, '%');
Instead of:
SELECT id, first_name, age FROM student_details WHERE first_name = NVL ( :name, first_name);

Write the query as:
SELECT id, first_name, age FROM student_details WHERE first_name LIKE NVL ( :name, '%');
Instead of:
SELECT id, first_name, age FROM student_details WHERE first_name = NVL ( :name, first_name); 

Write the query as:
SELECT product_id, product_name FROM product 
WHERE unit_price BETWEEN MAX(unit_price) and MIN(unit_price)
Instead of:
SELECT product_id, product_name FROM product WHERE unit_price >= MAX(unit_price) 
and unit_price <= MIN(unit_price) 

Write the query as:
SELECT id, name, salary FROM employee WHERE dept = 'Electronics' AND location = 'Bangalore';
Instead of:
SELECT id, name, salary FROM employee WHERE dept || location= 'Electronics Bangalore';

Use non-column expression on one side of the query because it will be processed earlier.
Write the query as:
SELECT id, name, salary FROM employee WHERE salary < 25000;
Instead of:
SELECT id, name, salary FROM employee WHERE salary + 10000 < 35000;

Write the query as:
SELECT id, first_name, age FROM student_details WHERE age > 10;
Instead of:
SELECT id, first_name, age FROM student_details WHERE age NOT = 10;



7) Utilize Union instead of OR
Indexes lose their speed advantage when using them in OR-situations
Write the query as:
SELECT * FROM TABLE WHERE COLUMN_A = 'value' OR COLUMN_B = 'value'
Instead of:
SELECT * FROM TABLE WHERE COLUMN_A = 'value'
UNION
SELECT * FROM TABLE WHERE COLUMN_B = 'value'
8) Use DECODE to avoid the scanning of same rows or joining the same table repetitively. DECODE can also be made used in place of GROUP BY or ORDER BY clause.
For Example: Write the query as
SELECT id FROM employee WHERE name LIKE 'Ramesh%' and location = 'Bangalore';
Instead of:
SELECT DECODE(location,'Bangalore', id, NULL) id FROM employee WHERE name LIKE 'Ramesh%';

10) To store large binary objects, first place them in the file system and add the file path in the database.
11) To write queries which provide efficient performance follow the general SQL standard rules.
     a) Use single case for all SQL verbs
     b) Begin all SQL verbs on a new line
     c) Separate all words with a single space 
     d) Right or left aligning verbs within the initial SQL verb

Tuesday, May 1, 2012

Manual Migration SQLSERVER to PostgreSQL

Following article explains a manual migration of a small SQL Server database structure to a postgres SQL
Create table scripts 
1.       Force to lower case (Notepad++  -> edit -> convert case to -> lower)
2.       Remove all square brackets ( [ and ] ),  owner prefixes (i.e. "dbo.").
3.       Remove all non-supported keywords (i.e. "WITH NOCHECK", "WITH CHECK",  "CLUSTERED", “ASC”, “with fillfactor = 90”, “ON PRIMARY”, “TEXTIMAGE_ON”)
4.       Remove unsupported commands (entire rows) using  following regular expression (Notepad++ for small scripts)
· ^.*//******.*$
· ^.*SET ANSI_NULLS.*$
· ^.*SET QUOTED_IDENTIFIER.*$

5.       Replace the T-SQL batch terminator "GO" with the PostgreSQL batch terminator ";" and manually remove additional “;”  OR remove all “GO” statements and add “;” for each sql statement.

6.       Replace all non-supported data types
  • Nvarchar, text, ntext -- varchar
  • DATETIME, Smalldatetime -- TIMESTAMP
  • MONEY -- NUMERIC(19,4)
  • Bit (Boolean type) -- smallint or boolean
  • Smalldatetime -- TIMESTAMP
  • Float -- bigint
  • decimal -- numeric
  • Tinyint -- smallint
  • Int -- integer
  • Image -- bytea
7.       uniqueidentifier data type

Operation
PostgreSQL
SQL Server
Create a table that has a UUID/GUID type automatically assigned
CREATE TABLE equipment( equip_id uuid PRIMARY KEY        
DEFAULT uuid_generate_v1() ,
equip_name  varchar(255));
CREATE TABLE equipment(
    equip_id uniqueidentifier
PRIMARY KEY
DEFAULT NEWID() ,
equip_name varchar(255));

CREATE TABLE equipment(
equip_id uniqueidentifier
PRIMARY KEY ROWGUIDCOL ,
equip_name varchar(255) );
Create a table that has text representation of UUID/GUID as primary for easier transportation
CREATE TABLE equipment(
equip_id char(32) PRIMARY KEY
DEFAULT LOWER(
REPLACE(  CAST(
uuid_generate_v1() As varchar(50))  , '-','')
 ) ,  equip_name varchar(255));
CREATE TABLE equipment(
    equip_id char(32) PRIMARY KEY
DEFAULT LOWER( REPLACE(
CAST(NEWID() As varchar(50)), '-','')
) , equip_name varchar(255));


8.   Instead of int IDENTITY(1,1) for an auto incrementing column, Postgres uses sequences, the easiest way to transform this is to use the serial (or bigserial for 8 byte numbers) data type If you need to start the sequence at a specified seed or have it decrement or increment by more than 1, you must create the sequence manually.
CREATE SEQUENCE <> START <>  INCREMENT <> ;
And then assign the default value for the column to use this sequence, using thenextval function:CREATE TABLE ( int default nextval('<>') not null;)
9. Find “DEFAULT (0)” value and add that manually at the end of relevant field in relevant table. Then remove “ALTER TABLE ……….. CHECK CONSTRAINT ……… DEFAULT (0) FOR …….” Rows
10. Remove “ALTER TABLE ……….. CHECK CONSTRAINT ………” rows
11.Change the index statements
SQL Server index
create nonclustered index ix_tblstudiesfieldstatus on studies (field_status_id );
Postgres Index
CREATE INDEX ix_tblstudiesfieldstatus  ON ccadmin.studies USING tree( field_status_id ); ALTER TABLE ccadmin.studies CLUSTER ON times_idx;
12.   Replace “getdate()” with “now()”
13.   Replace “len” with “length” 


Example

SQL Server Table
Postgre SQL Table
/****** Object:  Table [dbo].[WEBINC_TRACK]    Script Date: 05/03/2012 11:07:38 ******/
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[WEBINC_TRACK](
[SESSIONID] [uniqueidentifier] NOT NULL CONSTRAINT  [DF_WEBINC_TRACK_SESSIONID]  DEFAULT (newid()),
[ID] [int] IDENTITY(1,1) NOT NULL,
[TrackDate] [smalldatetime] NOT NULL CONSTRAINT [DF_WEBINC_TRACK_TrackDate]  DEFAULT (getdate()),
 [pathinfo] [varchar](1000) NULL,
[loginpathinfo] [varchar](1000) NULL,
[EMAILSENT] [tinyint] NULL,
CONSTRAINT [PK_WEBINC_TRACK] PRIMARY KEY CLUSTERED
([ID] ASC )WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
GO
Drop table if exists webinc_track;
create table webinc_track(
sessionid varchar(36) not null constraint df_webinc_track_sessionid default upper(cast(uuid_generate_v4() as varchar(36))),
id serial not null,
trackdate TIMESTAMP not null constraint df_webinc_track_trackdate  default (now()),
puin varchar(255) null,
pathinfo varchar(1000) null,
loginpathinfo varchar(1000) null,
emailsent smallint null,
 constraint pk_webinc_track primary key  (id));

SQL Server Function
Postgre SQL function
/****** Object:  UserDefinedFunction [dbo].[GetGenderFromSalutation]    Script Date: 05/09/2012 16:52:18 ******/
SET ANSI_NULLS OFF
GO
SET QUOTED_IDENTIFIER OFF
GO
CREATE FUNCTION  [dbo].[GetGenderFromSalutation](@SalutationID smallint)
RETURNS smallint  AS 
BEGIN
declare @result smallint
set @result = 0
if @SalutationID in (3,4,7,2) 
                set @result =  2
else if (@SalutationID = 1)
                set @result =  1
return @result
END
CREATE OR REPLACE FUNCTION GetGenderFromSalutation(SalutationID smallint)
RETURNS smallint  AS
$BODY$
DECLARE
                result smallint := 0 ;
BEGIN

IF (SalutationID in (3,4,7,2)) THEN
                result :=  2;
ELSIF (SalutationID = 1) THEN
                result :=  1;
END IF;
RETURN result;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
Procedures and functions
Apply same rules for procedures and function as applied in table creation scripts. Additionally follow following point.
1)      Instead of “create function” use “Create or replace function”
2)      Add “AS $BODY$ DECLARE” after returning data type at beginning
3)      Remove individual declare phase from variable declaration and add “;” at end of each variable declaration statement.
4)      Add “then”  and “END IF;” word in all if conditional blocks. Further replace “else if” phase with “ELSIF”
5)      Remove “SET” and “@” words
6)      Add “$BODY$  LANGUAGE plpgsql VOLATILE” at the end of the function

7) PostgreSQL functions do not support output parameters, avoid them.
8) PostgreSQL functions do not support default values, all parameters must be passed to the function (although function overtyping is supported, allowing wrapper functions that allow defaults).
9) Use FOUND instead of @@FETCH_STATUS. Found is true if the last SELECT, SELECT INTO, or FETCH statement returned a value.
10) Instead of @@IDENTITY use currval(sequence name) by default the sequence name is __seq.
11) Remove DEALLOCATE cursor statements, the function is accomplished with CLOSE and DEALLOCATE has a different use.
12) Use PERFORM instead of EXEC to call stored procedures/functions.
13) Forvariable declarations, SELECT foo = foo column won't work. Use either SELECT INTO or the PL/pgsql definition := (as in foo := foocolumn.)
14) Error handling is different, when a PL/pgsql program encounters a nerror the execution is immediately stopped(can't check @@ERROR) or execution isnt to the EXCEPTION block (if it exists) after the current BEGIN block. Thus, if you want to return specific error values, you will need an EXCEPTION block.

Application level changes
1.Change "SELECT TOP 10 * FROM table" queries in to "SELECT * FROM table LIMIT 10" structure
2. LIKE statements are case sensitive in postgresql
3. concatination so "SELECT firstname + ' ' + lastname AS fullname" becomes "SELECT firstname || ' ' || lastname AS fullname"

Monday, March 26, 2012

Python basic tutorial


#genaral
C:\Python27\ python         #cmd prompt sart
import os                          # import a module
dir(os)                              # list all parameters in os module
os.listdir("C:\users\iss")  # list * files in a folder
help(modulename)           #To get the docs on all the functions at once

import aa                          #import a method 1
aa.get_directory(aa.drive)
from aa import *               #import a method 2
get_directory(drive)

#----------------------------------------------------------------
# String Formatting
name = "John"
age = 23
print "%s is %d years old." % (name, age)

mylist = [1,2,3]
print "A list: %s" % mylist       # List print

%s  - String (or any object with a string representation, like numbers)
%d  - Integers
%f   - Floating point numbers
%. f - Floating point numbers with a fixed amount of digits to the right of the dot.
%x/%X - Integers in hex representation (lowercase/uppercase)

s = "Hey there! what should this string be?"
print "Length of s = %d" % len(s)                                                # Length = 38
print "The first occurrence of the letter a = %d" % s.index("a")  # First occurrence of "a" =13
print "a occurs %d times" % s.count("a")                                    # Number of a's =1
# Slicing the string into bits
print "The first five characters are '%s'" % s[:5]                # Start to 5
print "The next five characters are '%s'" % s[5:10]            # 5 to 10
print "The twelfth character is '%s'" % s[12]                     # Just number 12
print "The last five characters are '%s'" % s[-5:]               # 5th-from-last to end

print "String in uppercase: %s" % s.upper()                      # Convert everything to uppercase
print "String in lowercase: %s" % s.lower()                     # Convert everything to lowercase

# Check how a string starts
s = "Str Hey there! what should this string be? some"
if s.startswith("Str"):
    print "String starts with 'Str'. Good!"
# Check how a string ends
if s.endswith("ome!"):
    print "String ends with 'ome!'. Good!"
# Split the string into three separate strings
print "Split the words of the string: %s" % s.split(" ")

#Basic Operators-------------------------------------------------
number = 1 + 2 * 3 / 4.0
remainder = 11 % 3
squared = 7 ** 2
cubed = 2 ** 3
helloworld = "hello" + " " + "world"
lotsofhellos = "hello" * 10                               # multiplying strings to form a string
even_numbers = [2,4,6,8]
odd_numbers = [1,3,5,7]
all_numbers = odd_numbers + even_numbers #Lists can be joined
print [1,2,3] * 3                                                #repeating sequence

#Conditions------------------------------------------------------
x = 2
print x == 2  # prints True
print x == 3  # prints False
print x < 3   # prints True

name = "John"
age = 23
if name == "John" and age == 23:
    print "Your name is John, and you are also 23 years old."
if name == "John" or name == "Rick":
    print "Your name is either John or Rick."

#Loop------------------------------------------------------------
#The "for" loop - Prints out the numbers 0,1,2,3,4
for x in xrange(5):
    print x
# Prints out 3,4,5
for x in xrange(3,6):
    print x

#"while" loops - Prints out 0,1,2,3,4
count = 0
while count < 5:
    print count
    count += 1
#break is used to exit a for loop or a while loop, whereas continue is used to skip the
#current block, and return to the "for" or "while"
count = 0
while True:
    print count
    if count > 5:
        break
# Prints out only odd numbers - 1,3,5,7,9
for x in xrange(10):
    # Check if x is even
    if x % 2 == 0:
        continue
    print x

#Function---------------------------------------------------------
def func2(x, y, z, m):
    print x+y+z+m
func2(1,2,3,4)

#function within fuunction
def wrapper1(func, *args): # with star
    func(*args)
def wrapper2(func, args): # without star ..positional args
    func(*args)
def func2(x, y, z, m):
    print x+y+z+m

wrapper1(func2, 1, 2, 3, 4)
wrapper2(func2, [1, 2, 3, 4])

#multiple function argument
def foo(xxx, yyy, zzz, *mmm):
    print "First: %s" % xxx
    print "Second: %s" % yyy
    print "Third: %s" % zzz
    print "And all the rest... %s" % mmm
foo(1,2,3,4)

#classes and objects-----------------------------------------------
class MyClass:
      variable = "blah" # define variable withing functtion
      def function(self): # define variable
           print "This is a message inside the class."
       
myobjectx = MyClass()      # assign class to an object
print myobjectx.variable    # access variable within class
print myobjectx.function()  # access function within class

myobjecty = MyClass()
myobjecty.variable = "yackity"
print myobjecty.variable

#Dictionaries-------------------------------------------------------
#A dictionary is a data type similar to arrays, but works with keys and values instead of
#indexes. Each value stored in a dictionary can be accessed using a key, which is any type of
#object (a string, a number, a list, etc.) instead of using its index to address it

phonebook = {}
phonebook["John"] = 938477566
phonebook["Jack"] = 938377264
or
phonebook = {
    "John" : 938477566,
    "Jack" : 938377264
}
#dictionary, unlike a list, does not keep the order of the values stored in it.
for name, number in phonebook.iteritems():
    print "Phone number of %s is %d" % (name, number)

#To remove a specified index, use either one of the following notations:
del phonebook["John"] or
phonebook.remove("John")

#Genaraters----------------------------------------------------------
#Generators are simple functions which return an iterable set of items, one at a time, in a special way.
import random
def lottery():
    # returns 6 numbers between 1 and 40
    for i in xrange(6):
        yield random.randint(1, 40)
    # returns a 7th number between 1 and 15
    yield random.randint(1,15)

for random_number in lottery():
    print "And the next number is... %d!" % random_number

#List Comprehensions--------------------------------------------------
#creates a new list based on another list, in a single, readable line
sentence = "the quick brown fox jumps over the lazy dog"
words = sentence.split()
word_lengths = [len(word) for word in words if word != "the"]
print words
print word_lengths

#regular expression----------------------------------------------------
#Search and Replace: Some of the most important re methods that use regular expressions is sub.
#re.sub(pattern, repl, string, max=0)
#replace all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided.
#This method would return modified string.
phone = "2004-959-559          #This is Phone Number"
num = re.sub(r'#.', "", phone)  # replace "This"  = 2004-959-559 #is Phone Number
num = re.sub(r'#.*$', "", phone) # replace All after # mark with "" = 2004-959-559
num = re.sub(r'\D', "", phone)   # replace anything other than digits with ""  = 2004959559
print "Phone Num : ", num

#http://www.tutorialspoint.com/python/python_reg_expressions.htm
#Pattern Description
#-----------------------------------------------------------------------------------------------------
^ Matches beginning of line.
$ Matches end of line.
. Matches any single character except newline. Using m option allows it to match newline as well.
[...] Matches any single character in brackets.
[^...] Matches any single character not in brackets
re* Matches 0 or more occurrences of preceding expression.
re+ Matches 1 or more occurrence of preceding expression.
re? Matches 0 or 1 occurrence of preceding expression.
re{ n} Matches exactly n number of occurrences of preceding expression.
re{ n,} Matches n or more occurrences of preceding expression.
re{ n, m} Matches at least n and at most m occurrences of preceding expression.
a| b Matches either a or b.
(re) Groups regular expressions and remembers matched text.
(?imx) Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?: re) Groups regular expressions without remembering matched text.
(?imx: re) Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re) Temporarily toggles off i, m, or x options within parentheses.
(?#...) Comment.
(?= re) Specifies position using a pattern. Doesn't have a range.
(?! re) Specifies position using pattern negation. Doesn't have a range.
(?> re) Matches independent pattern without backtracking.
\w Matches word characters.
\W Matches nonword characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches nonwhitespace.
\d Matches digits. Equivalent to [0-9].
\D Matches nondigits.
\A Matches beginning of string.
\Z Matches end of string. If a newline exists, it matches just before newline.
\z Matches end of string.
\G Matches point where last match finished.
\b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\B Matches nonword boundaries.
\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
\1...\9 Matches nth grouped subexpression.
\10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.


File handler and line counts
f = open('received/20000402172012MARK0OUT.DAT', 'r')
s = open('received/20000402172012MARK0.NA', 'r')
num_lines = sum(1 for line in f)
num_liness= sum(1 for line in s)
print num_lines + num_liness