I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical. Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2. You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.

So, to remove this transitive dependency, we need to divide the relation R. While dividing a relation always place the candidate key, and all the attributes that depend on that candidate key in the first relation. In next divided relation, we will place the attribute that causes transitive dependency and also the attributes that depend on it in the second relation.

  What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.

A table or a relation is considered to be in Third Normal Form only if the table is already in 2NF and there is no non-prime attribute transitively dependent on the candidate key of a relation.

diff_df = pd.merge(df1, df2, how='outer', indicator='Exist') diff_df = diff_df.loc[diff_df['Exist'] != 'both'] You will have a dataframe of all rows that don't exist on both df1 and df2.

Difference Between 3NF and BCNF - Tech Difference

So, before I address the process of normalizing a table in 3NF, allow me to discuss the candidate key. A Candidate Key is minimal super key i.e. a super key with minimum attributes that can define all attributes of a relation.

But, a transitive dependency is observed among the functional dependencies provided, as the attribute F is not directly dependent on candidate key AB. Instead, attribute F is transitively dependent on candidate key AB via attribute D. Till attribute D has some value we can reach to attribute value of F, from the candidate key AB.

Now, the tables R1 nd R2 are in BCNF. Relation R1 has two candidate keys A and B, the trivial functional dependency of R1 i.e. A-> BCD and B -> ACD, hold for BCNF as A and B are the super keys for relation. Relation R2 has D as its candidate key and the functional dependency D -> F also holds for BCNF as D is a Super Key.

df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity']) df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity']) dict1 = diff_func(df1, df2, 'Fruits')

pd.concat([df1, df2]).loc[ df1.index.symmetric_difference(df2.index) ]

Observing functional dependencies, we can conclude that AB is a candidate key for relation R because using key AB we can search the value for all the attribute in a relation R. So A, B becomes prime attributes as they together make candidate key. The attributes C, D, E, F becomes non-prime attributes because none of them is the part of a candidate key.

The basic difference between 3NF and BCNF is that 3NF eliminates the transitive dependency from a relation and a table to be in BCNF, the trivial functional dependency X->Y in a relation must hold, only if X is the super key.

So, before I address the process of normalizing a table in 3NF, allow me to discuss the candidate key. A Candidate Key is minimal super key i.e. a super key with minimum attributes that can define all attributes of a relation.

By observing the relation R, we can say that A and BF are candidate keys of relation R, because they alone can search the value for all attributes in the relation R. So A, B, F are the prime attributes whereas, C and D are non-prime attributes. No transitive dependency is observed in the functional dependencies present above. Hence, the table R is in 3NF.

The basic difference between 3NF and BCNF is that 3NF eliminates the transitive dependency from a relation and a table to be in BCNF, for each functional dependency X->Y, X must be the super key

Normalization is a method that removes redundancy from a relation thereby minimizing the insertion, deletion and update anomalies that degrade the performance of databases. In this article, we will differentiate among two higher normal forms i.e. 3NF and BCNF.

