Interesting link reveals the possible use of hidden botnets in networks.
Link
Showing posts with label Grid. Show all posts
Showing posts with label Grid. Show all posts
Friday, June 12, 2009
Thursday, June 11, 2009
Review: Design considerations for a network of Information
In this position paper they insist on the fact that the Internet has to be reshaped and be focused on data instead of endpoints, in order to have a data centric network or network of information.
According to Jacobson [1], the first generation of the network dealt with connecting wires. The second one focused on end nodes hosting data while the third generation should refocus on what humans care the most about Information.
Their information models distinguish between 2 main objects:
Data Objects (DO): are the actual bit-patterns that contain the information: such as file, phone call, video, web page, a song(Beethoven's 9th symphony) etc. These data objects can be divided into "chunks" smaller pieces in order to simplify the transfer.
Information Objects (IO): holds semantic information and meta-data related to data objects, such as Beethoven's 9th symphony is an mp3 file encoded with 128kbps, IOs can be composed of other IOs or directly pointing to one or multiple DOs. An IO can represent the Eiffel tower and point to DO like pictures, a wiki page or service to buy tickets, etc.
Versioning and Revocation:
Since some information is frequently changing such as news papers. An IO can represent today's version, however the IO should adapt dynamically by binding to another DO (web page) the next day, and so on for the IO pointing to yesterday's news.
They suggest that objects invalidate themselves in order to conserve consistency. After an amount of time, objects should be recertified before it can be used. By applying this technique, they maintain consistency due to disconnected operation during information update of other replicas for example. DO can be deleted same way when no certification is given.
Security considerations:
Security in today's architecture is based on confidence (Encryption keys) of the host delivering the object, they propose to reshape security conventions so we can handle secured data instead of secured tunnels.
Integrity and authenticity is directly tied to object's names which means there would be a cryptographic relation between the name and object such as self-certifying names. However to enable off-line verification, DO must carry private keys which can be compromised. Another approach is to assign precomputed signatures to objects. It remains a research field.
Name resolution(NR):
Data objects are retrieved based on their unique identity (UID). NR starts with locating the object in the network then routing forwards the object retrieval query to its storage location and finally the DO is sent to the requesting client.
The naming resolution resolves an UID into one or more locations and should work on global and local scale by cooperating between NR systems for example. They show the side effects if ID/address split mechanism is adopted with the following example, if a laptop hosting numerous Data Objects moves its location then all Data objects location changes too. This will lead to huge number of updates in the NR system.
The NR system will be influenced by the characteristics of namespaces. They would like to adopt flat names which respects the non right of ownership and other characteristics revealed by [2].
Off course, using flat names prevents the use of hierarchical names spaces and systems like DNS.
DHT based solutions are promising since the are decentralized, scalable, self-organized and don't need central structure. However when going globally DHT uses flat names hence non hierarchical names which prevents cooperation with other systems.
Routing:
Addressable entities are still increasing and will reach millions even billions in few years with the emergence of sensor networks, Internet of things, growing data etc. They claim that routing research are not encouraging according to [3] (will be reviewed later). Hence, they ll investigate the efficiency of name based routing which integrate both resolution and retrieval paths. Name based routing will locate DO based on their ID by transforming the ID directly into a path without going through ID-address transition. Other techniques such as LLc and NodeID are to be investigated also (Soon will be reviewed).
Storage:
The information network can be implemented following two different models:
Search:
Search systems are expected to go far beyond text match search, such as semantic search or even search functionality based on GPS position, location positioning. For example when a picture of Eiffel tower is taken, a search mechanism will handle the identification of the monument based on GPS or other techniques and points to DO related informations such as web page, history etc.
This position paper gives many ideas and anticipations about the future Internet architecture and reveals the weakness in the current addressing system. They distinguish between DO and IO and argued that a network of information needs a scalable naming system supported by an efficient routing system.
References:
1 - V. Jacobson, M. Mosko, D. Smetters, and J. Garcia-Luna-Aceves. Content-centric networking. Whitepaper, Palo Alto Research Center, Jan. 2007.
2 - M. Walfish, H. Balakrishnan, and S. Shenker. Untangling the web from DNS. In NSDI’04: Proc. 1st Symp. on Networked Systems Design and Implementation, San Francisco, CA, USA, 2004.
Link to the article
According to Jacobson [1], the first generation of the network dealt with connecting wires. The second one focused on end nodes hosting data while the third generation should refocus on what humans care the most about Information.
Their information models distinguish between 2 main objects:
Data Objects (DO): are the actual bit-patterns that contain the information: such as file, phone call, video, web page, a song(Beethoven's 9th symphony) etc. These data objects can be divided into "chunks" smaller pieces in order to simplify the transfer.
Information Objects (IO): holds semantic information and meta-data related to data objects, such as Beethoven's 9th symphony is an mp3 file encoded with 128kbps, IOs can be composed of other IOs or directly pointing to one or multiple DOs. An IO can represent the Eiffel tower and point to DO like pictures, a wiki page or service to buy tickets, etc.
Versioning and Revocation:
Since some information is frequently changing such as news papers. An IO can represent today's version, however the IO should adapt dynamically by binding to another DO (web page) the next day, and so on for the IO pointing to yesterday's news.
They suggest that objects invalidate themselves in order to conserve consistency. After an amount of time, objects should be recertified before it can be used. By applying this technique, they maintain consistency due to disconnected operation during information update of other replicas for example. DO can be deleted same way when no certification is given.
Security considerations:
Security in today's architecture is based on confidence (Encryption keys) of the host delivering the object, they propose to reshape security conventions so we can handle secured data instead of secured tunnels.
Integrity and authenticity is directly tied to object's names which means there would be a cryptographic relation between the name and object such as self-certifying names. However to enable off-line verification, DO must carry private keys which can be compromised. Another approach is to assign precomputed signatures to objects. It remains a research field.
Name resolution(NR):
Data objects are retrieved based on their unique identity (UID). NR starts with locating the object in the network then routing forwards the object retrieval query to its storage location and finally the DO is sent to the requesting client.
The naming resolution resolves an UID into one or more locations and should work on global and local scale by cooperating between NR systems for example. They show the side effects if ID/address split mechanism is adopted with the following example, if a laptop hosting numerous Data Objects moves its location then all Data objects location changes too. This will lead to huge number of updates in the NR system.
The NR system will be influenced by the characteristics of namespaces. They would like to adopt flat names which respects the non right of ownership and other characteristics revealed by [2].
Off course, using flat names prevents the use of hierarchical names spaces and systems like DNS.
DHT based solutions are promising since the are decentralized, scalable, self-organized and don't need central structure. However when going globally DHT uses flat names hence non hierarchical names which prevents cooperation with other systems.
Routing:
Addressable entities are still increasing and will reach millions even billions in few years with the emergence of sensor networks, Internet of things, growing data etc. They claim that routing research are not encouraging according to [3] (will be reviewed later). Hence, they ll investigate the efficiency of name based routing which integrate both resolution and retrieval paths. Name based routing will locate DO based on their ID by transforming the ID directly into a path without going through ID-address transition. Other techniques such as LLc and NodeID are to be investigated also (Soon will be reviewed).
Storage:
The information network can be implemented following two different models:
- Network based storage model where storage resources are provided by the network infrastructure, like dedicated storage servers.
- Network managed storage model where network nodes control portions of storage memory of users connected to the network. Users will be able to decide what DO goes public or be shared only with friends etc.
Search:
Search systems are expected to go far beyond text match search, such as semantic search or even search functionality based on GPS position, location positioning. For example when a picture of Eiffel tower is taken, a search mechanism will handle the identification of the monument based on GPS or other techniques and points to DO related informations such as web page, history etc.
This position paper gives many ideas and anticipations about the future Internet architecture and reveals the weakness in the current addressing system. They distinguish between DO and IO and argued that a network of information needs a scalable naming system supported by an efficient routing system.
References:
1 - V. Jacobson, M. Mosko, D. Smetters, and J. Garcia-Luna-Aceves. Content-centric networking. Whitepaper, Palo Alto Research Center, Jan. 2007.
2 - M. Walfish, H. Balakrishnan, and S. Shenker. Untangling the web from DNS. In NSDI’04: Proc. 1st Symp. on Networked Systems Design and Implementation, San Francisco, CA, USA, 2004.
Link to the article
Labels:
address,
DHT,
DNS,
google,
Grid,
Home Network,
indexing,
IP,
mutlicast,
naming,
network,
P2P,
privacy,
Routers,
Routing,
search engine,
structured,
unstructured
Tuesday, April 21, 2009
Review: SENS - a Scalable and Expressive Naming System using CAN Routing Algorithm
This paper proposes a Scalable and Expressive Naming System (SENS) based on CAN routing algorithm. SENS is a descriptive naming scheme that uses a tuple of attribute/value pairs to name each resource. A resource such a computer is named as ( String OS= "Linux", string CPU-name = "Pentium 4", etc). Those informations are stored at a large number of name servers (NS).
Their design claims to achieve scalable and efficient resource information distribution and retrieval. SENS handles exact (MEMORY = 512 MB) and multi-attribute range queries (MEMORY > 512 MB) with small overhead as well as load balancing.
DESIGN OF SENS:
Mapping resource names to resource IDs:
A resource ID is considered as a set of d coordinates of a point in the d-dimensional resource ID space. (In the example below d = 6). A resource name is mapped to a resource ID by assigning the hash value of each attribute/value (a/v) pairs of the resource name to a coordinate value of the resource ID. Name servers are responsible for resource ID sets just like in the CAN system.
Ha is a function that uniformly hashes every attribute from 1 to d and Hv hashes every attribute value in a [1, 2^(m-1)] interval where m is the maximum size of coordinate value in bits.

If multiple attributes in a resource name are hashed to the same value Ha (attr i) = Ha (attr j) then the corresponding attribute values will be mapped to multiple coordinate values in the same dimension. This means that resource names are mapped to multiple resource IDs which are distributed on NSs.

Their mapping scheme is not injective, several resource names can be mapped to the same resource ID. Consequently resource ID is not a unique identifier of resource name, in order to identify a resource, the resource_ID (resulting form Hash) and its name are required to uniquely identify a resource.
In the case of numerical attribute values, a locality preserving hashing function is used. Such hashing function is defined as if (val1 > val2) then (Hval1 > Hval2). the main purpose behind using a locality preserving hashing function is to deal with range queries, in fact it ensures that a resource ID will be in a interval of resource IDs between a min and max value. By doing this they limit the number of NSs responsible for a query range.
If the attribute/value pairs number is lower than d then the resource ID is filled with zeros.
However when the attribute/value pairs number is higher than d then the set of attributes should be divided to multiple sets of attributes which corresponds to multiple resource names. (this aspect of fragmentation is not treated in this paper).
Resource Information Distribution:
Since their scheme is CAN based, zones are assigned to NS, consequently each NS manages resource information according to the resource ID.
In the case of a resource name mapping to multiple resource IDs. If the NS is responsible for several resource IDs of the same resource, only one copy in maintained at the NS. When resource IDs belongs to different zones i.e. different NS, they use a multicast routing algorithm based on Spanning Binomial Trees. Their algorithm sends minimum amount of messages to deliver information to a NS.
Resource IDs corresponding to a resource name construct a hypercube in the resource ID space according to this article (Optimum Broadcasting and Personalized Communication in the Hypercube) . (They don't explain how the hypercube is built, so further reviews will detail the construction algorithm of hypercubes).

The registration message containing information of the resource is first delivered to the NS responsible for the resource ID created from the lowest values of each resource IDs coordinates. (In this example root node is (0.0.0)). This NS becomes the root node and forwards the message to its descendants according to the tree. Those descendants also will forward the registration message to their descendants according to the tree. etc
Query resolution:
SENS supports:
This article propose a more expressive naming scheme than DNS which offers a value-limited resource name space without the possibility to realize the range query.
Other systems such as the Intentional naming system uses a descriptive name space based also on attribute/value pairs. The message routing for a name query is realized by look up of the query name on forwarding tables. Main limitation of such systems is scalability since the forwarding tables will grow with the number of resource names.
DHT-based routing protocols like Chord, CAN achieve a scalable and efficient lookup by hashing a resource name to a DHT key. Range query is the limitation of such systems, a query may spread to the hole DHT key space.
Other systems proposes range queries like MAAN where nodes are responsible for attribute values of an attribute in the query range. The node responsible for the attribute/value pairs (String OS = "Linux") must keeps information related to Linux OS. The main limitation is load balancing since popular attribute/value pairs may appear in resource names with high probability.
SENS is naming system capable of handling resource information with exact and multi-attribute range queries. They propose a multicast/broadcast algorithm to deliver/retrieve information.
However some issues remains unclear:
Several resource names might be mapped to same resource ID and the unique identification is done by resource ID and resource name. Is it enough to uniquely identify a resource? Is it possible to map 2 different resource names having same attribute values in common into the same resource ID? In 2005 Xiaoyun Wang and Hongbo Yu achieved collision on purpose in the MD5 hashing algorithm (2 different values were hashed to same ID).
Is SENS trying to merge between search engines and DNS-like systems?
Do we really need to merge between such systems? is it faster ? More reliable? More scalable ?
Is it a good approach when a resource is mapped to many resource IDs? Having multiple IDs will accelerate routing and finding the nearest ID? What if we mapped essential and important data to multiple IDs and restrained non important data to a single resource ID? How to design such a ranking system to classify data according to it's importance?
What if I updated my RAM from 512 MB to 1024 MB, how SENS manages such updates?
Is space partition efficient? when a node joins the system, a lot of information is handled to the new joining node ? Consequently huge messages are transmitted to the new node.
Link to the article
Their design claims to achieve scalable and efficient resource information distribution and retrieval. SENS handles exact (MEMORY = 512 MB) and multi-attribute range queries (MEMORY > 512 MB) with small overhead as well as load balancing.
DESIGN OF SENS:
Mapping resource names to resource IDs:
A resource ID is considered as a set of d coordinates of a point in the d-dimensional resource ID space. (In the example below d = 6). A resource name is mapped to a resource ID by assigning the hash value of each attribute/value (a/v) pairs of the resource name to a coordinate value of the resource ID. Name servers are responsible for resource ID sets just like in the CAN system.
Ha is a function that uniformly hashes every attribute from 1 to d and Hv hashes every attribute value in a [1, 2^(m-1)] interval where m is the maximum size of coordinate value in bits.
If multiple attributes in a resource name are hashed to the same value Ha (attr i) = Ha (attr j) then the corresponding attribute values will be mapped to multiple coordinate values in the same dimension. This means that resource names are mapped to multiple resource IDs which are distributed on NSs.
Their mapping scheme is not injective, several resource names can be mapped to the same resource ID. Consequently resource ID is not a unique identifier of resource name, in order to identify a resource, the resource_ID (resulting form Hash) and its name are required to uniquely identify a resource.
In the case of numerical attribute values, a locality preserving hashing function is used. Such hashing function is defined as if (val1 > val2) then (Hval1 > Hval2). the main purpose behind using a locality preserving hashing function is to deal with range queries, in fact it ensures that a resource ID will be in a interval of resource IDs between a min and max value. By doing this they limit the number of NSs responsible for a query range.
If the attribute/value pairs number is lower than d then the resource ID is filled with zeros.
However when the attribute/value pairs number is higher than d then the set of attributes should be divided to multiple sets of attributes which corresponds to multiple resource names. (this aspect of fragmentation is not treated in this paper).
Resource Information Distribution:
Since their scheme is CAN based, zones are assigned to NS, consequently each NS manages resource information according to the resource ID.
In the case of a resource name mapping to multiple resource IDs. If the NS is responsible for several resource IDs of the same resource, only one copy in maintained at the NS. When resource IDs belongs to different zones i.e. different NS, they use a multicast routing algorithm based on Spanning Binomial Trees. Their algorithm sends minimum amount of messages to deliver information to a NS.
Resource IDs corresponding to a resource name construct a hypercube in the resource ID space according to this article (Optimum Broadcasting and Personalized Communication in the Hypercube) . (They don't explain how the hypercube is built, so further reviews will detail the construction algorithm of hypercubes).
The registration message containing information of the resource is first delivered to the NS responsible for the resource ID created from the lowest values of each resource IDs coordinates. (In this example root node is (0.0.0)). This NS becomes the root node and forwards the message to its descendants according to the tree. Those descendants also will forward the registration message to their descendants according to the tree. etc
Query resolution:
SENS supports:
- Exact queries: A query host sends a message to a NS which will map the query resource name to resource IDs and select the nearest destination resource ID. The message arrives to destination using the CAN routing algorithm. The NS responsible for the resource ID will lookup its database to find the queried information and send it back to the initial NS (the one first queried by the host).
- Range queries, in the case of a range query it will be limited by the hash values of the upper and lower limit of the queried value ranges in each dimension. When a host sends a range query message to a NS, the latter will map the query range to a range query segment in the resource ID space. A query message will be broadcasted to all NSs whose zones overlap the segment. They propose a broadcasting algorithm based on the SBT and hypercube in order to reduce the number of messages broadcasted. (Further details will be added once I read the article that treats the hypercube and SBT formation).
This article propose a more expressive naming scheme than DNS which offers a value-limited resource name space without the possibility to realize the range query.
Other systems such as the Intentional naming system uses a descriptive name space based also on attribute/value pairs. The message routing for a name query is realized by look up of the query name on forwarding tables. Main limitation of such systems is scalability since the forwarding tables will grow with the number of resource names.
DHT-based routing protocols like Chord, CAN achieve a scalable and efficient lookup by hashing a resource name to a DHT key. Range query is the limitation of such systems, a query may spread to the hole DHT key space.
Other systems proposes range queries like MAAN where nodes are responsible for attribute values of an attribute in the query range. The node responsible for the attribute/value pairs (String OS = "Linux") must keeps information related to Linux OS. The main limitation is load balancing since popular attribute/value pairs may appear in resource names with high probability.
SENS is naming system capable of handling resource information with exact and multi-attribute range queries. They propose a multicast/broadcast algorithm to deliver/retrieve information.
However some issues remains unclear:
Several resource names might be mapped to same resource ID and the unique identification is done by resource ID and resource name. Is it enough to uniquely identify a resource? Is it possible to map 2 different resource names having same attribute values in common into the same resource ID? In 2005 Xiaoyun Wang and Hongbo Yu achieved collision on purpose in the MD5 hashing algorithm (2 different values were hashed to same ID).
Is SENS trying to merge between search engines and DNS-like systems?
Do we really need to merge between such systems? is it faster ? More reliable? More scalable ?
Is it a good approach when a resource is mapped to many resource IDs? Having multiple IDs will accelerate routing and finding the nearest ID? What if we mapped essential and important data to multiple IDs and restrained non important data to a single resource ID? How to design such a ranking system to classify data according to it's importance?
What if I updated my RAM from 512 MB to 1024 MB, how SENS manages such updates?
Is space partition efficient? when a node joins the system, a lot of information is handled to the new joining node ? Consequently huge messages are transmitted to the new node.
Link to the article
Wednesday, April 15, 2009
Review: Toward a search architecture for software components
This paper proposes a design of a component search engine for Grid applications.
With the development of the component based programming model, applications are going to be more dynamically formed with the associations of components. Developers should be able to reuse already developed components that matches their needs. To do so, a component search engine seems to be essential.
The component search for Grid applications offers two facilities:
One of the related works:
Agora components search engine supports the location and indexing of components and the search and retrieval of a component. Agora discovers automatically sites containing software components by crawling the web (Google's web crawler), when it finds a page containing an Applet tag, it downloads and indexes the related component. Agora supports JavaBeans and CORBA components. The database search is keyword based refined by users.
Workflows:
A workflow can be described as a process description of how tasks are done, by whom, in what order and how quickly.
Workflows are represented with low level languages such as BPEL4WS which requires too much user effort to describe a simple workflow.
Other high level language and Graphical User Interface on top of BPEL4WS are being introduced/build that generates BPEL code.
Their approach is workflow based: components can be adapted and coordinated through workflows. Applications should be able to choose and bind with other components from different sources on the Grid. Such applications searches first in its own local repository for components previously used or installed and uses a search engine to find suitable components.
The application development process can be divided into 3 stages:
To this stage, I don't have advanced knowledge in such systems and search engines but I find this approach interesting since the world of component development is emerging.
In the near future, thousands of components will be developed and ready to use. One of the main reasons of the wide adoption of the component based programming model is the ability to reuse already developed components and save time during the development process. A search engine seems to be necessary in order to find and locate suitable components.
Some issues in their approach remains unexplained or not clear such as:
Link to the article
PS: According to the Ranking algorithm, the rank of the page hosting the article increased while the rank of my blog is decreasing, actually I am offering a portion of my page's rank.
With the development of the component based programming model, applications are going to be more dynamically formed with the associations of components. Developers should be able to reuse already developed components that matches their needs. To do so, a component search engine seems to be essential.
The component search for Grid applications offers two facilities:
- Developers will be able to find the best component for their need.
- The framework can replace a malfunctioning or a slow component dynamically (at run time). The application should be able to decide the best component to be replaced with the malfunctioning.
One of the related works:
Agora components search engine supports the location and indexing of components and the search and retrieval of a component. Agora discovers automatically sites containing software components by crawling the web (Google's web crawler), when it finds a page containing an Applet tag, it downloads and indexes the related component. Agora supports JavaBeans and CORBA components. The database search is keyword based refined by users.
Workflows:
A workflow can be described as a process description of how tasks are done, by whom, in what order and how quickly.
Workflows are represented with low level languages such as BPEL4WS which requires too much user effort to describe a simple workflow.
Other high level language and Graphical User Interface on top of BPEL4WS are being introduced/build that generates BPEL code.
Their approach is workflow based: components can be adapted and coordinated through workflows. Applications should be able to choose and bind with other components from different sources on the Grid. Such applications searches first in its own local repository for components previously used or installed and uses a search engine to find suitable components.
The application development process can be divided into 3 stages:
- Application Sketching is when developers specifies: (1) An abstract workflow plan containing the way information passes through the application's parts. (2) A place-holder describing the functions and operations to be carried out. This description will help finding a list of suited components.
- Components discovering is based on 2 steps: First they resolve the place-holder query by searching in the local repository. If a suitable component is found locally than an identifier of this component is returned to the application. Second, If no component was found, a Query session is started on remote sites. A list of ranked components is returned and refined by user specifications.
- Application assembling is the binding phase. Data or protocol conversion are often needed due to the heterogeneous input/output between components (string to array of double conversion etc).
- The Component Crawler is like a Web Crawler, it retrieves new components and updates links (bindings) between components and pass the results to the indexer.
- The Indexer will build the index data structure of GRIDLE. Characteristics and meta data associated to the component should be carefully selected to be indexed. Actually the meta information associated to components will help retrieve the suited one. Such Meta data can be: (1) Functional information like interfaces (published methods, names, signatures) and runtime environment. (2) Non functional information such as QoS and textual description. (3) Linking information to other components.
- The Query Analyzer resolves the queries on index basis, it uses a ranking module to retrieve the most relevant components, the search will be refined by the user.
To this stage, I don't have advanced knowledge in such systems and search engines but I find this approach interesting since the world of component development is emerging.
In the near future, thousands of components will be developed and ready to use. One of the main reasons of the wide adoption of the component based programming model is the ability to reuse already developed components and save time during the development process. A search engine seems to be necessary in order to find and locate suitable components.
Some issues in their approach remains unexplained or not clear such as:
- Components will be updated , deleted, added, so how to determine the crawler iteration frequency in order to update the indexing?
- The same question appears when dealing with Component binding, since the model is inspired from Web pages, I think that components are more dynamic when it deals with binding with other components. Bindings will dynamically (on runtime) appear/disappear when replacing a component, how to maintain the ranking of a component? What is the frequency of the component ranking algorithm ?
- In their approach, first they search locally for a suited component. What if remote sites holds better suited components with higher ranks than those already placed in the local repository? What policy to use in order to keep updating the local repository?
- The Crawling module searches for new components, do we need to insert an agent on every repository?
- How to manage the heterogeneous aspects between components? COM and CORBA components?
- Semantic web and ontology use might simplify the mapping and query even though it is considered to be a disadvantage for the designers of GRIDLE due to the unique usage of a unified taxonomy.
Link to the article
PS: According to the Ranking algorithm, the rank of the page hosting the article increased while the rank of my blog is decreasing, actually I am offering a portion of my page's rank.

Labels:
application,
component,
google,
Grid,
Grid components,
indexing,
Metadata,
Metamodel,
naming,
Ranking,
search engine,
service discovery,
workflow
Subscribe to:
Posts (Atom)